2019-08-30 12:48:14 UTC - geal: what is the difference between the 
`CommandSend` `num_messages` field and the message metadata’s 
`num_messages_in_batch` ?
----
2019-08-30 13:14:39 UTC - Rajiv Abraham: Hi, is there a FTP source connector 
that you are aware of? If not, is there a prescribed way of taking an existing 
Kafka FTP connector and converting that to Pulsar?
----
2019-08-30 14:08:36 UTC - Vladimir Shchur: `num_messages` looks to be not very 
important, but `num_messages_in_batch` is very important - it's presence 
distinguishes if message is batched or not
----
2019-08-30 14:09:21 UTC - geal: both are apparently related to batching
----
2019-08-30 14:11:14 UTC - geal: maybe there are funny behaviours if both fields 
do not agree
----
2019-08-30 14:11:15 UTC - Vladimir Shchur: `num_messages` is optional, so as I 
remember everything will work correctly without it. As for 
`num_messages_in_batch` it's presence will make message a batch even if it is 
equal to 1
----
2019-08-30 14:14:49 UTC - Matteo Merli: The reason for having both is that 
broker doesn’t necessarely parse the MessageMetadata (to avoid the CPU cost of 
it). So by repeating the batch size on the CommandSend, we can have the 
accurate publish rate stats
----
2019-08-30 14:16:17 UTC - Matteo Merli: I guess we need to fix the seek 
(probably the test for the seek on last message id was missing).
+1 : bsideup
----
2019-08-30 14:17:27 UTC - Matteo Merli: > but it doesn’t support 
per-partition reader, doesn’t it?

You can create a reader on each partition. What you won’t have is the failover 
mechanism to select an active consumer.
----
2019-08-30 14:19:07 UTC - geal: oh, this makes sense, thanks
----
2019-08-30 14:19:27 UTC - geal: I just looked at the java client, both fields 
will use the same value if it is batching: 
<https://github.com/apache/pulsar/blob/b45736ad0738116c9c3cae27ed18f1342b55139e/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java#L143-L144>
----
2019-08-30 14:20:16 UTC - Matteo Merli: Yes, the rationale for that check is 
that `2242:21` is not a valid entry and it would not be in the future, since 
the ledger is already closed and sealed at last entry 20.
----
2019-08-30 14:22:32 UTC - Matteo Merli: @Rowanto can you explain how you select 
`2242:21` as message id to seek? That shouldn’t be a message that the consumer 
has ever seen?
----
2019-08-30 14:33:58 UTC - Matteo Merli: the client is trying to send 100 msg 
each of 1MB, per second.

If you can send that traffic smoothly, you shouldn’t have any memory problem. 
Just assume avg latency * 100 MB/s, and you’ll get the max amount of memory 
required.

The main issue is that client is buffering messages until it gets a 
confirmation from broker. Producer has a finite queue of pending messages, 
which by default holds up to 1K messages. For 1MB messages that will represent 
a lot of memory.

For sending big messages, you should tune tune down the producer queue size:
----
2019-08-30 14:34:35 UTC - Matteo Merli: eg: `pulsar-perf produce .... 
--max-outstanding 10`
----
2019-08-30 14:37:47 UTC - Matteo Merli: It depends on the subscription types :
 * Shared -&gt; It’s easy: round-robin across available consumers for each 
partition

 * Failover -&gt; There will be 1 active consumer and N standby consumers for 
each partition.
  The brokers will pick the active such that each consumer will be active on 
the same number of partitions
----
2019-08-30 14:39:04 UTC - Matteo Merli: the way it’s implemented is using the 
“consumer name” (which by default it’s a random string). The consumers gets 
sorted on a list by their name. Then for `partition-0`, brokers will pick the 
1st consumer, `partition-1` the 2nd and so on
----
2019-08-30 14:40:08 UTC - Matteo Merli: There’s no prescribed way, though it 
should be straight forward to convert an existing connector
----
2019-08-30 14:41:31 UTC - bsideup: since every consumer gets a random name (at 
least by default), how does it know that the selected consumers are not from 
the same “machine”?
----
2019-08-30 14:41:55 UTC - bsideup: Maybe it uses the client (shared between the 
consumers) name or something?
----
2019-08-30 14:42:38 UTC - Matteo Merli: yes, the change is on broker side and 
it should be a couple of lines
----
2019-08-30 14:45:30 UTC - Matteo Merli: No, it’s literally a random string. If 
there’s a conflict (2 consumers with same name), the effect is not 
catastrophic. Since they sort in unpredictable way, one of them might have a 
slightly different number of partitions than the other, but that’s it
----
2019-08-30 14:46:18 UTC - Matteo Merli: the intention was to have a mechanism 
to elect an “active” consumer without doing any coordination between brokers
----
2019-08-30 14:59:45 UTC - Rajiv Abraham: @Matteo Merli Thanks. Is there an 
existing connector that you would recommend I look at which is closest to FTP?
----
2019-08-30 15:29:49 UTC - Ali Ahmed: you can look at the pulsar-io-file 
connector to get started
+1 : David Kjerrumgaard
----
2019-08-30 15:50:51 UTC - Ming Fang: @David Kjerrumgaard You’re right, NiFi is 
not the right tool for processing.  I’ll look to use Flink instead.
+1 : David Kjerrumgaard
----
2019-08-30 16:04:24 UTC - Retardust: Is there any advices about RAID choose for 
bookkeeper? with kafka people says that JBOD is badly balancing so you should 
prefer RAID 10. But with ledger architecture it's probably better spread of 
data?
----
2019-08-30 16:14:45 UTC - Rajiv Abraham: thanks @Ali Ahmed 
:slightly_smiling_face:
----
2019-08-30 16:56:16 UTC - Retardust: bookkeeper documentation says "To achieve 
optimal performance, BookKeeper requires each server to have at least two 
disks. " - is it about physical or logical drives?
----
2019-08-30 17:01:27 UTC - David Kjerrumgaard: @Retardust It is about having 
separate physical disks. To reduce the contention between random i/o and 
sequential write.
----
2019-08-30 17:01:36 UTC - David Kjerrumgaard: It is possible to run with a 
single disk, but performance will be significantly lower.
----
2019-08-30 17:04:54 UTC - David Kjerrumgaard: @Retardust As for your RAID 
questions, you won't need to implement hardware level redundancy, such as 
mirroring since BookKeeper already takes care of data replication at the 
software level, which is better as it protects you from both a single disk 
failure, like RAID does, as well as a single machine failure, which RAID cannot.
----
2019-08-30 17:08:57 UTC - Retardust: thanks.
And what about raid0? does bookkeeper spread load good enough? Do know any 
benchmarks with raid0 vs jbod vs two logical and physical disks?
Will bookkeeper will slow with one logical disk as with one physical disk?
----
2019-08-30 17:12:44 UTC - Luke Lu: Quick question: if I have 100 inactive 
subscriptions on a topic, would they cause 100x write amplification on 
bookkeeper due to separate backlog for each subscription? I hope that 
subscriptions would just maintain cursors/caches for a shared ledger of the 
topic…
----
2019-08-30 17:40:26 UTC - David Kjerrumgaard: RAID0 will spread the load of a 
single "logical" disk across 2 physical disks. However, you don't want to use 
the same logical disk for both the journal and the ledgers, as you will 
ultimately end up with the same issue with contention as with a single disk.
----
2019-08-30 17:40:55 UTC - David Kjerrumgaard: I am not aware of any benchmark 
tests using different disk configurations, RAID0 vs. JBOD, etc.
----
2019-08-30 17:43:35 UTC - David Kjerrumgaard: In general the performance of 
RAID0 is highly dependent upon the workload, as there multiple tests conducted 
that show different results....some were significantly faster, while other 
tests showed decreased performance.
----
2019-08-30 17:49:01 UTC - David Kjerrumgaard: @Luke Lu All subscriptions are 
just offests into the same topic, so an increase in the number of subscriptions 
does NOT correspond to an increase in storage size. The theoretical maximum 
amount of data you would retain for a given topic is the equal to the number of 
entries between the oldest subscription`s acked message and the most recent 
message (Before data retention and TTL policies kick in). Let's say you have a 
really slow consumer on subscription A, and the last messge acked on it was at 
offset 100. The most recently received message in the topic is at offset 1M, 
with all the other subscription offsets lying somewhere in between. Only 
entries 101 through 1M would be retained in the topic.
----
2019-08-30 17:50:17 UTC - Luke Lu: Thanks! That’s exactly what I want to hear 
:slightly_smiling_face:
----
2019-08-30 17:51:51 UTC - David Kjerrumgaard: Bear in mind that subscription 
data is kept in zookeeper, so there is a cost for having a large number of 
subscriptions.
----
2019-08-30 17:54:24 UTC - Tarek Shaar: What's the max number of subscribers one 
can have for all topics combined?
----
2019-08-30 18:10:30 UTC - Matteo Merli: Up to millions 
----
2019-08-30 18:11:16 UTC - David Kjerrumgaard: I am not aware of a maximum. But 
millions can be handled provided you have a beefy enough ZK server.
----
2019-08-30 18:11:37 UTC - David Kjerrumgaard: just don't try to do it with a 
2GB RAM ZK instance.... :smiley:
----
2019-08-30 18:33:45 UTC - Retardust: OK, thanks!
----
2019-08-30 18:52:13 UTC - Tarek Shaar: That's good news for sure. So if I have 
100k topics and I need to have 100k separate subscriptions (one subscriber per 
topic), it perfectly ok to create one (or two) PulsarClient objects and then 
spawn 100k Subscribers to those topics from those PulsarClients?
----
2019-08-30 18:52:34 UTC - Matteo Merli: Yes
----
2019-08-30 18:53:24 UTC - Luke Lu: What’s the per subscription/topic overhead? 
How much zk memory (Xmx and/or MaxDirectoryMemorySize) is needed for 1M 
topics/subscriptions?
----
2019-08-30 19:02:24 UTC - David Kjerrumgaard: It's not about the size of the 
data being retained for each subscription, which is minimal since it is just an 
offset, but rather the volume of transactions on ZK as each of these are 
updated frequently.....so in ZK-land each of these changes is a transaction 
that needs to be recorded and synced, etc.
----
2019-08-30 19:07:40 UTC - David Kjerrumgaard: So it really comes down to the 
throughput of the system. The higher the throughput, the higher the number of 
offset updates, which you will need to keep in ZK memory.
----
2019-08-30 19:08:22 UTC - David Kjerrumgaard: 1M subscriptions consuming 10 
msg/sec is much different than 1M subscriptions on 50K msg/sec  :smiley:
----
2019-08-30 19:10:10 UTC - Luke Lu: I see. How about 1M topics/subscriptions at 
5K msg/sec?
----
2019-08-30 19:14:18 UTC - David Kjerrumgaard: 5K msg/sec for each and every 
topic? Meaning (1M * 5K) = 5B msg/sec?
----
2019-08-30 19:16:57 UTC - Luke Lu: Let’s say 1000 active topics, 5M msg/sec 
total.
----
2019-08-30 19:48:20 UTC - Tarek Shaar: Can someone share some 
documentation/material on the role of cursors in keeping track of messages? Do 
I need to look at the book keeper documentation?
----
2019-08-30 19:59:35 UTC - David Kjerrumgaard: 
<https://streaml.io/blog/cursors-in-pulsar>
----
2019-08-30 20:03:15 UTC - Tarek Shaar: Thanks David
----
2019-08-30 20:28:53 UTC - Tarek Shaar: How are subscribers implemented 
internally? I know each pulsar client is a physical TCP connection. Is each 
subscriber a thread? If yes then how about consumers that attach to 
subscribers? How are consumers implemented?
----
2019-08-30 20:50:17 UTC - jialin liu: interesting, thanks much
----
2019-08-30 21:11:50 UTC - jialin liu: That solved my issue. @Matteo Merli
----
2019-08-30 23:55:34 UTC - Joe Francis: You can always scale out the ZK if 
needed .. <https://github.com/apache/pulsar/wiki/PIP-8:-Pulsar-beyond-1M-topics>
----
2019-08-31 00:00:09 UTC - Luke Lu: Understood. It just create management 
overhead. Just want to understand a reasonable single namespace/region limits
----
2019-08-31 06:45:41 UTC - Pasam Revanth kumar: @Pasam Revanth kumar has joined 
the channel
----
2019-08-31 06:53:56 UTC - Pasam Revanth kumar: Doesn't pulsar support scala ??
----
2019-08-31 07:05:32 UTC - Bruno Bonnin: @Bruno Bonnin has joined the channel
----
2019-08-31 08:05:59 UTC - Ali Ahmed: @Pasam Revanth kumar here is the scala 
client <https://github.com/sksamuel/pulsar4s>
----
2019-08-31 08:07:30 UTC - Pasam Revanth kumar: But which language is being used 
by majority of the developers:thinking_face:Is it java??
----
2019-08-31 08:07:37 UTC - Ali Ahmed: yes
----

Reply via email to