Slack digest for #general - 2018-11-09

Apache Pulsar Slack Fri, 09 Nov 2018 01:11:17 -0800

2018-11-08 09:12:03 UTC - David Tinker: I have created a PR for this 
<https://github.com/apache/pulsar/pull/2961>
----
2018-11-08 11:17:20 UTC - Christophe Bornet: @Christophe Bornet has joined the 
channel
----
2018-11-08 12:52:41 UTC - Chris Miller: We have a project with dependencies on 
2.2.0 versions of pulsar-client, pulsar-client-admin, pulsar-client-flink, 
pulsar-functions-api. This results classpath collisions related to shaded vs 
unshaded Netty code, e.g:
```Exception in thread "PulsarListenerVerticle" java.lang.NoSuchMethodError: 
org.apache.pulsar.common.util.netty.EventLoopUtil.newEventLoopGroup(ILjava/util/concurrent/ThreadFactory;)Lorg/apache/pulsar/shade/io/netty/channel/EventLoopGroup;
        at 
org.apache.pulsar.client.impl.PulsarClientImpl.getEventLoopGroup(PulsarClientImpl.java:810)
        at 
org.apache.pulsar.client.impl.PulsarClientImpl.&lt;init&gt;(PulsarClientImpl.java:124)
        at 
org.apache.pulsar.client.impl.ClientBuilderImpl.build(ClientBuilderImpl.java:54)```
Seems like there's a fundamental problem with some Pulsar jars containing 
shaded classes and others that don't. Any ideas?


[in our case the conflict seems to be from having both pulsar-common-2.2.0.jar 
and pulsar-client-2.2.0.jar on the classpath, where the client jar contains 
shaded versions of some of the contents of common]
----
2018-11-08 13:34:00 UTC - Ivan Kelly: have you tried with 
pulsar-client-original?
----
2018-11-08 13:34:10 UTC - Ivan Kelly: i.e. the non-shaded version?
----
2018-11-08 13:59:23 UTC - Chris Miller: Interesting, I didn't realise that's 
what it was for. Presumably that will be fine as long as I get everything 
shaded off the classpath (or vice versa). I just find it surprising that eg 
pulsar-common and pulsar-client jars don't play well together.
----
2018-11-08 13:59:58 UTC - Ivan Kelly: ya, that is surprising
----
2018-11-08 14:00:09 UTC - Ivan Kelly: doesn't pulsar-client pull common though?
----
2018-11-08 14:03:43 UTC - Chris Miller: Haven't finished investigating this yet 
so I'm not 100% sure. Before I upgraded to 2.2.0, we weren't seeing 
pulsar-common-2.1.1.jar being pulled in to our classpath. I updated our 
dependency to 2.2.0 and also added pulsar-flink, now we're seeing pulsar-common 
being pulled in, so I still need to check what's pulling it in. Also, 
<https://github.com/apache/pulsar/pull/2783> might be related somehow?
----
2018-11-08 14:31:45 UTC - Ivan Kelly: ah, it's probably pulsar-flink pulling it 
in
----
2018-11-08 14:32:21 UTC - Ivan Kelly: hmm, no
----
2018-11-08 14:33:21 UTC - Ivan Kelly: ah, could be something to do with 
<https://github.com/apache/pulsar/commit/e72ed35cc2b07e2ec39b84a23ae1819d94b152f4>
----
2018-11-08 14:43:05 UTC - Chris Miller: Yes it is pulsar-flink. If I take a 
shell project with just that as a dependency, I end up with 
pulsar-client-2.2.0.jar, pulsar-client-original-2.2.0.jar, 
pulsar-common-2.2.0.jar, pulsar-flink-2.2.0.jar (and a whole bunch of other 
deps). And if I turn off transitive dependencies for pulsar-flink in the real 
project, the problem seems to be solved
----
2018-11-08 15:56:55 UTC - tuan nguyen anh: @tuan nguyen anh has joined the 
channel
----
2018-11-08 16:00:59 UTC - tuan nguyen anh: Hello all, i will presentation on 
pulsar so i am finding tool or framwork test performance in pulsar but not 
found. If you know, ask me. Thanks you
----
2018-11-08 16:03:10 UTC - Matteo Merli: 
<https://www.slideshare.net/mobile/merlimat/high-performance-messaging-with-apache-pulsar>
----
2018-11-08 16:05:55 UTC - tuan nguyen anh: thanks you, but i need testing in my 
pc and show your memtor
----
2018-11-08 16:06:07 UTC - Matteo Merli: @Chris Miller it’s some problem with 
the shade plugin that is including the transitive dependencies of a sibling 
module which gets shaded and should not have these dependencies. 

The commit that Ivan pointed out should fix the issue since shading was not 
really needed in the pulsar-flunk module itself. 

We’ll release it as part of 2.2.1 release soon
----
2018-11-08 16:09:14 UTC - Chris Miller: OK great to hear, thanks for the 
explanation and quick fix. In the meantime I'm happy disabling the transitive 
dependencies for 2.2.0 in my build script
----
2018-11-08 16:11:29 UTC - Matteo Merli: Then you can use `pulsar-perf` tool to 
send and receive traffic. More complex scenarios can be modeled through the 
OpenMessaging benchmark. 
----
2018-11-08 16:20:15 UTC - tuan nguyen anh: I found it
Thanks you and have a nice day
----
2018-11-08 20:25:22 UTC - Beast in Black: Hi Guys (and @Matteo Merli), I am 
using the pulsar cpp client in a C++ application on which I'm currently 
working, and I have a couple of questions regarding some odd behavior I've seen.

Some background: The application publishes a *lot* of data on multiple 
non-persistent topics (about 0.75KB to 1KB per second per topic), and I am 
seeing some OOM issues and node/instance restarts (My app is on AWS along with 
the pulsar infra - brokers, bookies, zookeeper etc, all as K8s pods in AWS). My 
current understanding is that non-persistent topics are in-memory only and do 
not persist anything to disk.

My questions are:
1. Would the vast amount of data published to the non-persistent topics cause 
an OOM condition on my server (the AWS node/instance in this case)?
2. If the answer to the above is "yes", is there any way to mitigate this 
issue? Maybe by setting memory constraints on the JVM?
----
2018-11-08 20:32:14 UTC - Beast in Black: Additionally,
3. If I set memory constraints on the JVM, what would happen to messages that 
are published once the JVM hits the memory limit?
4. If I delete my application's K8s pod, When the pod is restarted (it is a K8s 
StatefulSet), I notice that once the app comes back up, I see producer/consumer 
connect errors in the logs, these messages are from the pulsar client.

For (4) above I suspect that this is because when my app pod is restarted, it 
attempts to subscribe to the topics using the *same subscription name*. In this 
case, I'm wondering if the pulsar client errors are because from pulsar's 
perspective these subscriptions are still active, and so it errors out when I 
try to resubscribe using the same subscription name. When these errors happen, 
I've noticed that one way to fix it is to delete the broker pod first (which 
gets automatically restarted) and then delete my application pod (which also 
gets automatically restarted).

Doing this fixes the errors, which leads me to conclude that restarting the 
broker pod clears out existing subscriptions and so allows my application to 
come back up without the pulsar connect errors.

To fix this, my idea is to first unsubscribe the subscription name used by my 
app before I susbscribe again to the topic. So if I do this, *will it be safe 
i.e. will unsubscribing lead to loss of messages in the topic during the time 
between unsubscribing and then re-subscribing?*
----
2018-11-08 21:26:12 UTC - Emma Pollum: I'm having trouble restarting my pulsar 
nodes. when I do service restart it no longer connects to zookeeper with an 
exception saying that the znode is owned by another session. Any one run into 
this before?
----
2018-11-08 21:32:11 UTC - Emma Pollum: 
----
2018-11-08 22:03:28 UTC - Han Tang: Hi Experts, I am a newbie to Pulsar. Ignore 
me if I am asking a stupid question! Do any of you have interesting stories to 
share when your team was making the choice between Pulsar and Kafka?
----
2018-11-08 22:19:42 UTC - Beast in Black: @Han Tang IMNSHO this is a question 
best asked in the `random` slack channel. That said, here is some info for you 
on this matter:
<https://streaml.io/blog/pulsar-streaming-queuing>
<https://jack-vanlightly.com/blog/2018/9/14/how-to-lose-messages-on-a-kafka-cluster-part1>
 *along with* 
<https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-pulsar-cluster>

If you are a newbie, you may find this very informative and explanatory blog 
post extremely helpful (as I did): 
<https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works>
+1 : Matteo Merli, Ali Ahmed
----
2018-11-08 22:22:15 UTC - Beast in Black: @Han Tang some more informative posts:
<https://www.businesswire.com/news/home/20180306005633/en/Apache-Pulsar-Outperforms-Apache-Kafka-2.5x-OpenMessaging>
<http://www.jesse-anderson.com/2018/08/creating-work-queues-with-apache-kafka-and-apache-pulsar/>
+1 : Matteo Merli, Ali Ahmed
----
2018-11-08 22:34:44 UTC - Matteo Merli: That can happen with an abrupt 
shutdown. The session will automatically expire (eg. After 30sec) and the 
process will eventually succeed to restart. 

In any case, it’s better to do a graceful shutdown of processes, by sending 
SIGTERM instead of SIGKILL. That will ensure that the ZK session is properly 
closed and also ensure a smoother topic failover. 
----
2018-11-08 23:06:23 UTC - Christophe Bornet: Hi, I'm trying to add dead letter 
policy to the websocket proxy
I've added the lines
```
if (queryParams.containsKey("maxRedeliverCount")) {
            builder.deadLetterPolicy(DeadLetterPolicy.builder()
                    
.maxRedeliverCount(Integer.parseInt(queryParams.get("maxRedeliverCount")))
                    .deadLetterTopic("DLQ")
                    .build());
        }
```
to ConsumerHandler::getConsumerConfiguration() but it doesn't seem to work. If 
I don't ack the messages I still get them in loop. Someone know what I do wrong 
?
----
2018-11-08 23:13:55 UTC - Christophe Bornet: Oh, stupid me : I forgot to set 
the subscription as shared... I'll do a PR to add DLQ to websockets if that's 
fine for you
----
2018-11-08 23:15:24 UTC - Matteo Merli: Nice! :+1:
----
2018-11-08 23:26:51 UTC - Beast in Black: @Matteo Merli any insights you could 
give me on the questions I posted earlier?
----
2018-11-08 23:28:22 UTC - Matteo Merli: Missed those, let me take a look
----
2018-11-08 23:34:32 UTC - Beast in Black: @Matteo Merli np, thank you!
----
2018-11-08 23:56:27 UTC - Matteo Merli: @Beast in Black

&gt; 1. Would the vast amount of data published to the non-persistent topics 
cause an OOM condition on my server (the AWS node/instance in this case)?
&gt; 2. If the answer to the above is “yes”, is there any way to mitigate this 
issue? Maybe by setting memory constraints on the JVM?

Data for non-persistent topic is not “stored” in broker either: messages are 
immediately dispatched to consumers
or dropped if a consumer is not ready. The Pulsar broker acts like a proxy in 
this case, the only thing to pay
attention to is that all these in-flight messages are using memory (direct 
memory in this case).

There are ways to reduce the amount of memory used by Netty (through Netty 
specific configs) and we have work
in progress to make it simpler in Pulsar to configure that and also to have 
fallback strategy before giving up
to OOM (see: 
<https://github.com/apache/pulsar/wiki/PIP-24%3A-Simplify-memory-settings>)

For the specific problem, I’d suggest to increase direct memory limit in the 
broker and monitor the actual
used direct memory.

&gt; 3. If I set memory constraints on the JVM, what would happen to messages 
that are published once the JVM hits the memory limit?

There would be no change to the semantic. The problem is that if you hit JVM 
mem limit errors are being
triggered.

&gt; 4. If I delete my application’s K8s pod, When the pod is restarted (it is 
a K8s StatefulSet), I notice that once the app comes back up, I see 
producer/consumer connect errors in the logs, these messages are from the 
pulsar client.

The client library internally retries to reconnect until it suceeds, if you 
have multiple broker pods,
the reconnection time should be very quick.

&gt; For (4) above I suspect that this is because when my app pod is restarted, 
it attempts to subscribe to the topics using the *same subscription name*. In 
this case, I’m wondering if the pulsar client errors are because from pulsar’s 
perspective these subscriptions are still active, and so it errors out when I 
try to resubscribe using the same subscription name. When these errors happen, 
I’ve noticed that one way to fix it is to delete the broker pod first (which 
gets automatically restarted) and then delete my application pod (which also 
gets automatically restarted).

I’m not sure I get what’s happening there but I don’t think is a problem of the 
subscription being there. In any case the consumer will have disconnected so 
the broker will let it re-subscribe to the same subscription again.

&gt; To fix this, my idea is to first unsubscribe the subscription name used by 
my app before I susbscribe again to the topic. So if I do this, *will it be 
safe i.e. will unsubscribing lead to loss of messages in the topic during the 
time between unsubscribing and then re-subscribing?* (edited)

Keep in mind that, with non-persistent topics, data is never retained, so 
subscription are not persistent either.
----
2018-11-09 00:07:37 UTC - Beast in Black: @Matteo Merli that was very helpful, 
thank you very much.
----
2018-11-09 00:16:46 UTC - Beast in Black: @Matteo Merli a couple fo follow-up 
questions:

&gt; There would be no change to the semantic. The problem is that if you hit 
JVM mem limit errors are being triggered.
So would the messages (especially for non-persistent topics) still be 
published/received even if the JVM hits the limit? I imagine not, since that 
would - as you say - cause errors to be triggered.

&gt; I’m not sure I get what’s happening there but I don’t think is a problem 
of the subscription being there. In any case the consumer will have 
disconnected so the broker will let it re-subscribe to the same subscription 
again.
What would happen if my app process which runs the consumer never cleanly 
disconnected, but experienced a crash or was terminated using `SIGKILL`? Would 
the re-subscription (to persistent topics) still work in this case using the 
same subscription name?
----
2018-11-09 00:22:33 UTC - Matteo Merli: 1. When getting OOMs the JVM process 
might be left in bad state. That’s why I was recommending to set flags with 
decent amount of memory if you expect large throughput. The default we have are 
conservative, meant to be reasonable on many deployments, but should be 
increased for higher throughput. 


2. If a consumer is abruptly killed, crashed or simply partitioned from a 
broker, the broker will evict it after a while. There is a health check in the 
pulsar protocol to detect these cases (both for clients and brokers)
----
2018-11-09 00:25:28 UTC - Beast in Black: @Matteo Merli For (1) I will 
experiment with different values higher than the defaults, and see what happens.

For (2), what would happen in the case that the consumer comes back and tries 
to re-subscribe (using the same name) *before* the broker eviction kicks in 
(say within a minute of crashing/being crashed)? Would the broker allow the 
re-subscription?
----
2018-11-09 00:32:34 UTC - Matteo Merli: It depends on the subscription type, 
for exclusive subscription it will give error. For failover and shared 
subscriptions it will allow the consumer to get attached 
----
2018-11-09 00:33:56 UTC - Beast in Black: @Matteo Merli thanks!
+1 : Matteo Merli
----
2018-11-09 00:57:16 UTC - Beast in Black: @Matteo Merli just an FYI - I checked 
our application code, and it seems that on the app pod where I notice the 
consumer connect errors, the subscription type is *not* explicitly set to 
shared - it uses the default mode, which is exclusive mode as per 
`<https://pulsar.apache.org/docs/latest/getting-started/ConceptsAndArchitecture/#Exclusive-u84f1>`
 .

In light of the info you gave me above, it seems that this is most likely the 
cause of my woes, since when that pod is crashed, it recovers within a few 
minutes, so the broker does not have time to reap the previous subscription 
before the app pod tries to subscribe again with the same name. I will change 
the code to explicitly set the subscription mode to shared and see if my 
troubles melt away :smile:
----
2018-11-09 06:01:56 UTC - ytseng: @ytseng has joined the channel
----
2018-11-09 07:51:32 UTC - tuan nguyen anh: Hi all, i have a trouble with kafka 
and pulsar, when i run kafka and pulsar perf in my laptop with HDD disk, kafka 
with ~ 60k msg/s and pulsar 18k msg/s. So why diffirent it?
----
2018-11-09 07:54:05 UTC - Ali Ahmed: how are running the kafka test ?
----
2018-11-09 07:54:30 UTC - tuan nguyen anh: kafka-producer-perf-test.sh --topic 
TEST --num-records 18000000 --producer-props bootstrap.servers=localhost:9092 
--throughput 500000 --record-size 1024
----
2018-11-09 07:55:14 UTC - tuan nguyen anh: this is my command
----
2018-11-09 07:56:03 UTC - tuan nguyen anh: but if i run in SSD, pulsar 180k and 
kafka 100k
----
2018-11-09 07:59:31 UTC - Ali Ahmed: I think kafka producer  is setup to be in 
batch
----
2018-11-09 08:04:46 UTC - Sijie Guo: @tuan nguyen anh how do you run pulsar 
perf?
----
2018-11-09 08:07:34 UTC - Karthik Ramasamy: @Sijie Guo - is this due to the 
non-separation of journal vs data?
----
2018-11-09 08:12:25 UTC - Sijie Guo: there are a couple of factors:

1) how many partitions of kafka topic `TEST`? and how many partitions of pulsar 
topic? in the use.
2) fsync is on by default in pulsar, while kafka never does fsync. if you want 
a better apple-to-apple comparison, you should consider setting 
`journalSyncData` to `false` in conf/bookkeeper.conf
3) for pulsar, if using ssd, configuring multiple journal directories to 
improve io parallelism; if using hdd, separate journal from ledgers directories.
----
2018-11-09 08:12:26 UTC - tuan nguyen anh: pulsar-perf produce -r 500000 -time 
30 my-topic
----
2018-11-09 08:14:26 UTC - Karthik Ramasamy: If you are using HDD, a separate 
disk for journal and a separate disk for data - otherwise you will get into 
random seeks in the same disks that affects the publish throughput
----
2018-11-09 08:16:57 UTC - tuan nguyen anh: i think pulsar store data in file 
and ssd is good, it right?
----
2018-11-09 08:18:40 UTC - Sijie Guo: both pulsar and kafka store data in files, 
but just different in storage format. the difference is mainly on whether 
persistent to disk or not (pulsar does fsync to ensure no data loss, but kafka 
doesn’t).
----
2018-11-09 08:23:43 UTC - tuan nguyen anh: i try edit fsync in pulsar, but 
nothing change
----
2018-11-09 08:24:35 UTC - tuan nguyen anh: Throughput produced:  12861.0  msg/s 
---    100.5 Mbit/s --- Latency: mean:  83.247 ms - med:   4.227 - 95pct:  
11.610 - 99pct:  23.696 - 99.9pct: 10051.391 - 99.99pct: 10051.519 - Max: 
10054.015
----
2018-11-09 08:25:14 UTC - Sijie Guo: did you restart pulsar after editing fsync?
----
2018-11-09 08:25:35 UTC - Sijie Guo: and this is hdd?
----
2018-11-09 08:25:47 UTC - tuan nguyen anh: yes, edit journalSyncData = false 
and restart pulsar with hdd disk
----
2018-11-09 08:30:08 UTC - Sijie Guo: are you running standalone or ? are all 
settings default?
----
2018-11-09 08:31:04 UTC - tuan nguyen anh: i running by pulsar standalone
----
2018-11-09 08:36:28 UTC - Sijie Guo: that’s a bit weird. your latency doesn’t 
look like `fsync` is disabled. your mean latency is 83ms, 999 latency is up to 
10second
----
2018-11-09 08:37:20 UTC - Sijie Guo: oh
----
2018-11-09 08:37:41 UTC - Sijie Guo: you are running standalone, then you 
should modify conf/standalone.conf
----
2018-11-09 08:40:04 UTC - Sijie Guo: nvm, I don’t think you should use 
standalone for perf
----
2018-11-09 08:40:28 UTC - tuan nguyen anh: i checked and fsync set false in 
standalone.conf
----
2018-11-09 08:56:50 UTC - Sijie Guo: yeah, standalone is setting fsync to 
false. so the question would be why it takes 10 seconds to write to your hdd.

what is your memory? if you are on linux, can you run `free` to see how your 
memory was used?
----
2018-11-09 09:09:43 UTC - tuan nguyen anh: Here:
total        used        free      shared  buff/cache   available
Mem:        7860764     3417044     3336700      310032     1107020     3727636
Swap:       8075260      387584     7687676
----

Slack digest for #general - 2018-11-09

Reply via email to