Slack digest for #general - 2019-10-05

Apache Pulsar Slack Sat, 05 Oct 2019 02:11:45 -0700

2019-10-04 10:44:17 UTC - mgorelikov: @mgorelikov has joined the channel
----
2019-10-04 11:45:40 UTC - Roman: Hey Guys


I'm struggling with Bookkeeper and Presto and I hoped someone has a hint. I'm 
running a single bookkeeper node on Azure with 2 VCores, 4GB of RAM and two 
dedicated SSDs for BK ledgers and journal. Atm the system produces about 
550-600 messages/s that is stored in Pulsar for 24 hours. This accumulates to 
roughly 60GB of data. Now I want to search through this data using Presto. 
However a simple SELECT gives me very poor performance. For example a query 
like `SELECT 1 FROM "topic-0" WHERE "session.id" = '1';` takes hours (even 
days), reading 300 messages/s with a throughput of about 400kB/s. Reading from 
the same topic through a broker gives me a way better throughput. I observed 
100% CPU usage on the Bookkeeper server when a query is running. The query of 
course is executed on a different machine.

I guess I misconfigured something here... Or is this the performance I can 
expect from a single BK node? Any help is appreciated.
----
2019-10-04 12:49:09 UTC - Tim Howard: @Tim Howard has joined the channel
----
2019-10-04 12:51:14 UTC - Tim Howard: Greetings. We would like to use delayed 
message delivery in the go client. But, unless I'm mistaken, it does not appear 
to be implemented/exposed yet. Is that accurate and if it is, is anyone working 
on it?
----
2019-10-04 15:33:45 UTC - David Kjerrumgaard: @Roman Currently, the data stored 
in BK is only indexed by the message timestamp in Presto, This means that the 
query you ran above has to do an entire table scan. Since you are using a 
single BK, the opportunity for parallelism is very limited. Therefore your 
performance will be slow.
----
2019-10-04 15:35:20 UTC - Roman: So basically all I can do for the moment is to 
add more bookies?
----
2019-10-04 15:38:47 UTC - David Kjerrumgaard: Unfortunately for those queries, 
yes. I would suggest running a time-based query to see how fast it performs in 
comparison to the previous query.  We are also working on adding more indexes 
for Presto to use.
----
2019-10-04 15:44:31 UTC - Roman: Ok, I see. Guess I have to find another way to 
search through data as my use case requires to find all messages for a 
particular `session.id`. Maybe another job will do. Many thanks for the 
clarification.
----
2019-10-04 15:55:13 UTC - Matteo Merli: Correct, it’s not exposed yet, though 
it would be easy to add
----
2019-10-04 17:09:48 UTC - Tim Howard: Agreed. I have basically started and 
wanted to verify it didn't exist and then engage about doing the work.
----
2019-10-04 17:09:56 UTC - Tim Howard: Thanks for confirming.
----
2019-10-04 17:11:28 UTC - Tim Howard: I will try to email the list a bit later 
announcing that I am starting the work in case someone else comes along to 
start it?
----
2019-10-04 17:58:19 UTC - Matteo Merli: I’m creating a PR to add the support in 
C++ lib
----
2019-10-04 17:58:46 UTC - Tim Howard: oh, you are?
----
2019-10-04 17:59:00 UTC - Tim Howard: well- that exponentially reduces the work 
I have to do then.
----
2019-10-04 18:02:05 UTC - Tim Howard: do you have a branch I can use as a basis?
----
2019-10-04 18:10:35 UTC - Matteo Merli: Coming. Adding tests and I’ll push the 
PR
+1 : Tim Howard
----
2019-10-04 19:53:44 UTC - alexk777: @alexk777 has joined the channel
----
2019-10-04 21:15:27 UTC - Matteo Merli: 
<https://github.com/apache/pulsar/pull/5317>
----
2019-10-04 21:16:12 UTC - Tim Howard: outstanding, thanks
----
2019-10-04 21:33:56 UTC - Chris DiGiovanni: My team updated from Pulsar 2.3.0 
to ~2.4.3~ 2.4.1.  Everything seemed to go well, though it looks like our 
brokers jetty server is randomly dying with the following error:

```
2019-10-04 13:58:12.292 [pulsar-web-29-5] WARN  
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill -
java.nio.channels.CancelledKeyException: null
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:71) 
~[?:?]
        at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:130) 
~[?:?]
        at 
org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:473)
 ~[org.eclipse.jetty-jetty-io-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:351)
 ~[org.eclipse.jetty-jetty-io-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357)
 [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181)
 [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
 [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
 [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
 [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
```
----
2019-10-04 21:34:39 UTC - Ali Ahmed: there is no “2.4.3” released
----
2019-10-04 21:34:44 UTC - Chris DiGiovanni: Sorry 2.4.1
----
2019-10-04 21:34:50 UTC - Chris DiGiovanni: I missed typed
----
2019-10-04 21:35:16 UTC - Ali Ahmed: can you create a github issue
----
2019-10-04 21:35:21 UTC - Chris DiGiovanni: Sure
----
2019-10-04 21:41:57 UTC - Chris Bartholomew: @Chris DiGiovanni I was seeing a 
similar problem. I think in 2.4.1 the broker can hit the jetty bug 
<https://github.com/eclipse/jetty.project/issues/3133>. The jetty version 
didn't change between 2.3.0 and 2.4.1, but this bug is only exposed in when 
debug logging is enabled. A change went into 2.4 that enabled debug logging by 
default: <https://github.com/apache/pulsar/pull/3661>. Since I manually 
reverted that change, I haven't see that broker failure.
----
2019-10-04 21:55:14 UTC - Chris DiGiovanni: Interesting... I'll look into those.
----
2019-10-04 21:55:48 UTC - Chris DiGiovanni: I created the issue: 
<https://github.com/apache/pulsar/issues/5318>
----
2019-10-04 21:59:25 UTC - Matteo Merli: Take a look at 
<https://github.com/apache/pulsar/pull/5079>
----
2019-10-04 22:05:20 UTC - Chris DiGiovanni: I just made the logging change on 
my brokers.  I'll comment on the issue if the broker survives over the weekend.
----
2019-10-04 22:20:53 UTC - Addison Higham: ```
6:08:22.204 [pulsar-external-listener-3-1] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad-partition-1>] [consume-test] Closed 
Consumer (not connected)
16:08:22.206 [pulsar-external-listener-3-1] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad-partition-0>] [consume-test] Closed 
Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad-partition-3>] [consume-test] Closed 
Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad-partition-2>] [consume-test] Closed 
Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad-partition-4>] [consume-test] Closed 
Consumer (not connected)
16:08:22.208 [pulsar-external-listener-3-1] WARN  
org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - 
[<persistent://restructure-test/cg-logs/iad>] Failed to subscribe for topic 
[<persistent://restructure-test/cg-logs/iad>] in topics consumer, subscribe 
error: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 
lookup request timedout after ms 30000
16:08:22.208 [pulsar-external-listener-3-1] WARN  
org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - Failed subscription for 
createPartitionedConsumer: <persistent://restructure-test/cg-logs/iad> 5, 
e:java.util.concurrent.CompletionException: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 lookup 
request timedout after ms 30000
16:08:22.209 [main] ERROR org.apache.pulsar.client.cli.PulsarClientTool - Error 
while consuming messages
16:08:22.210 [main] ERROR org.apache.pulsar.client.cli.PulsarClientTool - 7 
lookup request timedout after ms 30000
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 lookup 
request timedout after ms 30000
        at 
org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:247)
 ~[org.apache.pulsar-pulsar-client-api-2.4.1.jar:2.4.1]
        at 
org.apache.pulsar.client.impl.ConsumerBuilderImpl.subscribe(ConsumerBuilderImpl.java:91)
 ~[org.apache.pulsar-pulsar-client-original-2.4.1.jar:2.4.1]
...
```
note sure what is going on here... can't create a subscription to a partitioned 
topic. I also can't subscribe to a single partition. It appears to just be a 
problem in this namespace, works fine in other namespaces. I confirmed my role 
has read/write permissions... the broker has log messages that say it is 
creating subscription, it just never returns to the client
----
2019-10-04 23:04:04 UTC - Addison Higham: okay, a little more down the rabbit 
hole: I can't delete the topic, when I try and delete it it gives me a 500 
error as it tries to delete all the topic partitions. I also tried manually 
deleting a a single partition of the topic and pointed directly at the assigned 
broker and it still failed. As best as I can tell, it is timing out somewhere 
and the exception is just swallowed so I get a 500
----
2019-10-04 23:10:02 UTC - Matteo Merli: can you take a jstack of the process?
----
2019-10-04 23:10:39 UTC - Matteo Merli: also, a heap dump will show the 
eventual exception, even if it gets eaten away
----
2019-10-04 23:13:53 UTC - Addison Higham: I have been periodically trying to 
see if I can figure out what is going on, as best as I can tell, it gets stuck 
here:
```
pulsar-web-33-3" #86 prio=5 os_prio=0 tid=0x00007f339ea36000 nid=0xcb waiting 
on condition [0x00007f333a921000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  &lt;0x00000000ccfabf60&gt; (a 
java.util.concurrent.CompletableFuture$Signaller)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at 
org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.getTopicReference(PersistentTopicsBase.java:1346)
        at 
org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopicForcefully(PersistentTopicsBase.java:312)
        at 
org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopic(PersistentTopicsBase.java:533)
        at 
org.apache.pulsar.broker.admin.v2.PersistentTopics.deleteTopic(PersistentTopics.java:360)
        at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
        ....
```
----
2019-10-04 23:14:05 UTC - Addison Higham: at least that is what consistently 
keeps showing up
----
2019-10-04 23:15:02 UTC - Addison Higham: 
----
2019-10-04 23:15:08 UTC - Addison Higham: ^^ full thread dump
----
2019-10-04 23:16:31 UTC - Addison Higham: so also... the reason I went all the 
down this path: I started a pulsar source (kinesis) and it output a few 
messages... and then stopped, it looked like the function was working just fine 
internally (i.e. filling up an internal buffer) but then read was just never 
getting called
----
2019-10-04 23:16:51 UTC - Addison Higham: then things got weirder with not 
being able to subscribe and now not being able to delete
----
2019-10-04 23:28:33 UTC - Addison Higham: a new topic in the same namespace has 
the same problem, but a new namespace has no issues. So I think there is 
something wrong the namespace. I created this namespace with some new terraform 
automation... and I am guessing something is weird there
----
2019-10-04 23:31:48 UTC - Addison Higham: okay yeah... all this is something 
weird with that namespace, I can't figure out what settings it doesn't like, we 
had used the `pulsar-admin` defaults of `0` for max-consumers/max-producers, 
which I am guess that default isn't correct and unset means something totally 
different
----
2019-10-04 23:33:27 UTC - Matteo Merli: That should mean unlimited 
----
2019-10-04 23:40:45 UTC - Addison Higham: :thinking_face: well I will try 
playing with settings and see if anything there can explain that behavior
----
2019-10-04 23:41:18 UTC - Addison Higham: but a new namespace created by the 
CLI (instead of by our terraform automation) has this totally working with the 
kinesis source, which is what sent me down this crazy path in the first place
----
2019-10-05 08:58:11 UTC - Poule: is the kubernetes version required to be 
1.10.5 or it's just an example in the doc:
```
minikube start --memory=8192 --cpus=4 \
    --kubernetes-version=v1.10.5
```
----
2019-10-05 09:00:59 UTC - Ali Ahmed: it’s just the example
hugging_face : Poule
----

Slack digest for #general - 2019-10-05

Reply via email to