Slack digest for #general - 2019-03-07

Apache Pulsar Slack Thu, 07 Mar 2019 01:11:21 -0800

2019-03-06 10:15:32 UTC - Vincent Ngan: :ok_hand:
----
2019-03-06 10:31:28 UTC - Byron: Ah ok, so clusters in an “instance” are 
assumed to share/replicate data. Understood thanks.
----
2019-03-06 12:34:09 UTC - Byron: Hi folks, I had a question about this 
statement in docs about increasing the number of partitions:
&gt; Already created partitioned producers and consumers can’t see newly 
created partitions and it requires to recreate them at application so, newly 
created producers and consumers can connect to newly added partitions as well. 
Therefore, it can violate partition ordering at producers until all producers 
are restarted at application.


This statement seems to imply that downtime is required (disconnect producers 
and consumers) before/during a re-partitioning? So app is running, disconnect 
clients, update the number of partitions, reconnect clients.. to guarantee 
ordering. Is this correct? A related question is whether existing messages are 
re-partitioned when this happens? In other words, if a consumer was created to 
read from one partition (per this thread 
<https://github.com/apache/pulsar/issues/3098>) then the consumer would need to 
change the topic name to the new partition to consume from. I suppose this 
wouldn’t work in the case of a custom partitioning function? I am not 
suggesting I would do this on the consumer side, but I am curious of the 
behavior and edge cases that one could run into.
----
2019-03-06 12:55:57 UTC - Sébastien de Melo: It's solved.  The command args 
must be:
                    mkdir logs &amp;&amp;
                    bin/apply-config-from-env.py conf/broker.conf &amp;&amp;
                    bin/apply-config-from-env.py conf/client.conf &amp;&amp;
                    bin/gen-yml-from-env.py conf/functions_worker.yml &amp;&amp;
                    bin/apply-config-from-env.py conf/pulsar_env.sh &amp;&amp;
                    bin/pulsar broker
----
2019-03-06 12:56:06 UTC - Sébastien de Melo: I am not sure for client.conf
----
2019-03-06 12:56:10 UTC - Darragh: hi, we've managed to get pulsar running on 
ec2 instances and are seeing some nice mean latencies when doing a pulsar-perf 
with n=10000, but the tail latencies spike quite often.  Any ideas as to what 
we could tweak ?
----
2019-03-06 13:11:22 UTC - Maarten Tielemans: Some information about the setup:
- The ledgers and journal run on seperate NVMe
- We are using XFS as filesystem for the NVMe
- The NVMe are ELB, type io, size 128GB, 6400 iops
- journalDataSync=true (but we also see the spikes when set to false)
- Bookkeeper and the broker run on the same instances
- We tried with multiple settings of ensemble, quorum and ack, I believe we 
currently use 3 2 2
- For Zookeeper, pulsar_env.sh was set to 2GB of memory. For Bookkeeper and 
broker (same instance) it was set to 12GB
- We also see the spikes when we use a non-persistent topic (200+ms 99.9% 
latency)
----
2019-03-06 13:26:15 UTC - Byron: Based on that issue, it appears that 
partitions are just internal topics? So if there is a requirement for message 
keys to be sticky to a topic, it seems that managing this explicitly is a 
better strategy?
----
2019-03-06 13:26:35 UTC - Maarten Tielemans: ```
13:23:47.474 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:   9999.9  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.313 ms - med:   0.309 - 95pct:   0.378 - 99pct:   0.420 - 99.9pct:   0.495 - 
99.99pct:   1.512 - Max:   1.558
13:23:57.479 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:   9999.8  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.313 ms - med:   0.309 - 95pct:   0.378 - 99pct:   0.418 - 99.9pct:   0.483 - 
99.99pct:   0.821 - Max:   1.512
13:24:07.484 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.2  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.312 ms - med:   0.308 - 95pct:   0.378 - 99pct:   0.422 - 99.9pct:   0.539 - 
99.99pct:   0.697 - Max:   1.564
13:24:17.489 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.5  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.316 ms - med:   0.310 - 95pct:   0.385 - 99pct:   0.441 - 99.9pct:   0.777 - 
99.99pct:   1.004 - Max:   1.049
13:24:27.509 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.5  msg/s ---     78.1 Mbit/s --- Latency: mean:   
2.142 ms - med:   0.319 - 95pct:   0.419 - 99pct: 126.272 - 99.9pct: 203.900 - 
99.99pct: 206.503 - Max: 207.122
13:24:37.518 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.1  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.306 ms - med:   0.296 - 95pct:   0.381 - 99pct:   0.429 - 99.9pct:   0.565 - 
99.99pct:   0.745 - Max:   1.491
13:24:47.523 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.1  msg/s ---     78.1 Mbit/s --- Latency: mean:   
0.310 ms - med:   0.303 - 95pct:   0.385 - 99pct:   0.423 - 99.9pct:   0.570 - 
99.99pct:   0.781 - Max:   0.794
```
(This is non-persistent, producer latency)
----
2019-03-06 13:40:14 UTC - Maarten Tielemans: ```
13:39:09.016 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.5  msg/s ---     78.1 Mbit/s --- Latency: mean:  
20.742 ms - med:   5.251 - 95pct: 152.269 - 99pct: 247.212 - 99.9pct: 286.991 - 
99.99pct: 294.061 - Max: 295.105
13:39:19.031 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.6  msg/s ---     78.1 Mbit/s --- Latency: mean:   
9.640 ms - med:   5.184 - 95pct:  31.784 - 99pct: 112.054 - 99.9pct: 278.791 - 
99.99pct: 285.463 - Max: 286.473
13:39:29.043 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:   9999.9  msg/s ---     78.1 Mbit/s --- Latency: mean:  
23.945 ms - med:   5.237 - 95pct: 169.349 - 99pct: 198.288 - 99.9pct: 323.295 - 
99.99pct: 323.377 - Max: 323.435
13:39:39.059 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - 
Throughput produced:  10000.5  msg/s ---     78.1 Mbit/s --- Latency: mean:  
25.457 ms - med:   5.263 - 95pct: 168.663 - 99pct: 205.695 - 99.9pct: 259.992 - 
99.99pct: 265.065 - Max: 317.673
```
(This is persistent, producer latency)
----
2019-03-06 13:55:33 UTC - Wang Jinhong: @Wang Jinhong has joined the channel
----
2019-03-06 13:55:56 UTC - Valery: @Valery has joined the channel
----
2019-03-06 14:35:05 UTC - Matteo Merli: Are the NVMe disks locally attached?
----
2019-03-06 14:35:56 UTC - Matteo Merli: The latency numbers are way off for 
being writing on nvmes
----
2019-03-06 14:36:54 UTC - Matteo Merli: In any case, to reduce tail latency, 
the preferred config would be 3 / 3 / 2
----
2019-03-06 14:37:14 UTC - Matteo Merli: That is, write to 3 bookies and wait 
for 2 acks
----
2019-03-06 14:38:06 UTC - Chris DiGiovanni: After trying to setup Tiered 
Storage to an Internal Ceph Rados Gateway (S3 API) I ran into what I thought 
would be the issue, certs.  After waiting for the offload status to come back.  
I get this error from `pulsar-admin topics offload-status`

```
null

Reason: Error offloading: org.apache.bookkeeper.mledger.ManagedLedgerException: 
java.util.concurrent.CompletionException: 
org.jclouds.http.HttpResponseException: 
sun.security.validator.ValidatorException: PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target connecting to POST 
https://**-***.***.**/***-dev-pulsar-topic-offload/3243c6cf-b3d0-4fff-8e40-912c61793a64-ledger-10?uploads
 HTTP/1.1
```
----
2019-03-06 14:38:31 UTC - Matteo Merli: Can you verify with `iostat` that the 
writes are indeed going to the expected disks?
----
2019-03-06 14:40:38 UTC - Chris DiGiovanni: I first inclination is to present 
my own Java keystore with our internal CA certs.  Though not sure how I add 
these options to the startup.  Currently deploying via Kubernetes
----
2019-03-06 14:43:47 UTC - Darragh: we can confirm that the writes are going to 
the nvme's with iostat
----
2019-03-06 14:44:31 UTC - Maarten Tielemans: ```
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.43    0.00    3.51    4.86    0.81   88.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     
0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     
0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00 1059.00     0.00    12.17    23.54     
0.92    0.88   0.85  90.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00     0.00     
0.00    0.00   0.00   0.00
```
----
2019-03-06 14:44:44 UTC - Darragh: ```[ec2-user@ip-10-0-2-47 ~]$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0    0 884.8G  0 disk 
xvda    202:0    0     8G  0 disk 
└─xvda1 202:1    0     8G  0 part /
xvdf    202:80   0   128G  0 disk /mnt/journal
xvdg    202:96   0   128G  0 disk /mnt/storage```
----
2019-03-06 14:46:16 UTC - Darragh: xvdf and xvdg are nvme ebs volumes
----
2019-03-06 14:52:10 UTC - Matteo Merli: So, from the above, xvdf looks like 90% 
busy
----
2019-03-06 14:53:25 UTC - Matteo Merli: That’s strange for 12 MB/s and 1K iops 
----
2019-03-06 14:54:28 UTC - Matteo Merli: Is the throughput tied to the volume 
size? It usually is on EC2
----
2019-03-06 14:55:46 UTC - Maarten Tielemans: Not sure. This is the 
configuration of the ELB volumes. I can change this if needed.
```
Size - 128 GiB
Encrypted - Not Encrypted
Volume type - io1
IOPS - 6400
```
----
2019-03-06 14:56:32 UTC - Maarten Tielemans: When we set 
`journalDataSync=false`, we still see similar tail latency numbers. The disk IO 
is different in those cases, and seems to happen in bursts of ~100-200MB (every 
10 seconds).
----
2019-03-06 14:59:43 UTC - Maarten Tielemans: Related to the throughput, the 
IOPS of the ELB volumes was limited to 50x size in GB (6400 = 50 * 128)
----
2019-03-06 15:13:28 UTC - Chris Bartholomew: Hi Folks. It seems that when 
de-duplication is enabled the storageSize for the topic never goes to 0 even if 
all the messages have been acked. In fact, even if I delete all the 
subcriptions I am still seeing a non-zero value for storage. Here are the stats 
for my topic:  ```{
    "msgRateIn": 0,
    "msgThroughputIn": 0,
    "msgRateOut": 0,
    "msgThroughputOut": 0,
    "averageMsgSize": 0,
    "storageSize": 55313,
    "publishers": [],
    "subscriptions": {},
    "replication": {},
    "deduplicationStatus": "Enabled"
}``` This doesn't happen if I have de-duplication disabled. Looking at the 
internal stats for the topic, I noticed that there is a cursor for 
deduplication: ``` "cursors": {
        "pulsar.dedup": {
            "markDeletePosition": "532:8999",
            "readPosition": "532:9986",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -986,
            "cursorLedger": -1,
            "cursorLedgerLastEntry": -1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2019-03-06T14:20:37.876Z",
            "state": "NoLedger",
            "numberOfEntriesSinceFirstNotAckedMessage": 987,
            "totalNonContiguousDeletedMessagesRange": 0,
            "properties": {
                "useast1-gcp-31-1": 9011
            }
        }
    }``` Are messages stored for de-duplication purposes? I wouldn't expect 
that to be necessary (IDs yes, messages no).
----
2019-03-06 15:37:31 UTC - Sijie Guo: can you show the full stats and 
stats-internal output?
----
2019-03-06 15:38:31 UTC - Sijie Guo: /cc @jia zhai or @Ivan Kelly. they might 
have a quick idea on this.
----
2019-03-06 15:44:03 UTC - Sijie Guo: &gt; This statement seems to imply that 
downtime is required (disconnect producers and consumers) before/during a 
re-partitioning?

the behavior has been changed in 2.3.0. partitions are automatically updated in 
producer and consumer when the number of partitions are changed. the 
documentation might need to update. (/cc @jia zhai for updating the 
documentation)

&gt; A related question is whether existing messages are re-partitioned when 
this happens?

currently pulsar doesn’t handle this. since the “repartition” is related how 
messages are routed in partitions. in order to support something like 
per-key-ordering, additional information might be required (e.g. the hashing 
rule and the number of partitions before rehashing), and consumers are required 
to consume messages in a certain order.
----
2019-03-06 15:45:42 UTC - Chris Bartholomew: Sure. What I previously posted was 
the full stats output. I had used the topic to send messages, but deleted all 
subscriptions, so I was expecting to see zeroes across the board:  ```{
    "msgRateIn": 0,
    "msgThroughputIn": 0,
    "msgRateOut": 0,
    "msgThroughputOut": 0,
    "averageMsgSize": 0,
    "storageSize": 55313,
    "publishers": [],
    "subscriptions": {},
    "replication": {},
    "deduplicationStatus": "Enabled"
}``` And internalStats: ```{
    "entriesAddedCounter": 0,
    "numberOfEntries": 9986,
    "totalSize": 559313,
    "currentLedgerEntries": 0,
    "currentLedgerSize": 0,
    "lastLedgerCreatedTimestamp": "2019-03-06T14:20:37.875Z",
    "waitingCursorsCount": 0,
    "pendingAddEntriesCount": 0,
    "lastConfirmedEntry": "532:9985",
    "state": "LedgerOpened",
    "ledgers": [
        {
            "ledgerId": 532,
            "entries": 9986,
            "size": 559313,
            "offloaded": false
        },
        {
            "ledgerId": 635,
            "entries": 0,
            "size": 0,
            "offloaded": false
        }
    ],
    "cursors": {
        "pulsar.dedup": {
            "markDeletePosition": "532:8999",
            "readPosition": "532:9986",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -986,
            "cursorLedger": -1,
            "cursorLedgerLastEntry": -1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2019-03-06T14:20:37.876Z",
            "state": "NoLedger",
            "numberOfEntriesSinceFirstNotAckedMessage": 987,
            "totalNonContiguousDeletedMessagesRange": 0,
            "properties": {
                "useast1-gcp-31-1": 9011
            }
        }
    }
}``` I probably should have mentioned that I have retention set on the 
namespace (2 days). But from what I can see, that doesn't usually affect the 
storageSize--that only tracks messages in the unacked messages in the 
subscription backlog.
----
2019-03-06 15:52:36 UTC - Sijie Guo: so from internal stats, there are 2 
ledgers, the size ledger first ledger is 559313. the dedup cursor ’s mark 
delete position is 532:8999 and the last position is 532:9986, that means the 
dedup cursor is holding about 987 entries, which is `55313` bytes showed in 
`stats`. this would explain why the storage size is not zero.

then the question is simpler now, why dedup cursor is holding those 987 entries?
----
2019-03-06 15:59:02 UTC - Chris Bartholomew: If I remember correctly, that is 
the number of messages I published to the topic. Then I deleted the 
subscription. However, I am pretty sure I see this even if I ack all the 
messages with a client (ie the subscription backlog is 0 on all subscriptions)
----
2019-03-06 16:01:29 UTC - Sijie Guo: &gt;  why dedup cursor is holding those 
987 entries?

to understand this, you might need to understand a bit how dedup works. I am 
trying to explain it in short. you can think about - the messages before cursor 
are snapshotted into a persitent map of producer and the sequence is seen until 
the cursor, the sequence id of messages after cursor are kept in memory, only 
when a new snapshot is taken, the cursor will be advanced (this is to guarantee 
durability and no state lost).

currently the snapshot mechanism is based on messages size (basically 
snapshotting every x messages). so if the new snapshot was not taken, those 
messages are kept.
----
2019-03-06 16:05:21 UTC - Alexandre DUVAL: Hi, when I try to list topics on 
tenants/namespaces as "normal user" i've got:

```➜ kannar@pond  ~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ 
../bin/pulsar-admin topics list yo/logs                                         
                                                                                
                                                                           
Don't have permission to administrate resources on this tenant

Reason: Don't have permission to administrate resources on this tenant
```
It's normal, but when I try with super, I have the following error:
```
➜ kannar@pond  ~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ 
../bin/pulsar-admin topics list yo/logs
HTTP 500 Server Error

Reason: HTTP 500 Server Error
```
When I check logs from my brokers I have 401 Authentication required. 
Interpreted as 500 by proxies I guess.
----
2019-03-06 16:06:28 UTC - Alexandre DUVAL: More I've got random 401, I run 
these commands in the same configuration: ```➜ kannar@pond  
~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ ../bin/pulsar-admin 
topics offload-status <persistent://yo/logs/full-partition-2>
Offload has not been run for <persistent://yo/logs/full-partition-2> since 
broker startup
➜ kannar@pond  ~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ 
../bin/pulsar-admin topics offload-status 
<persistent://yo/logs/full-partition-2>
HTTP 401 Unauthorized

Reason: HTTP 401 Unauthorized
➜ kannar@pond  ~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ 
../bin/pulsar-admin topics offload-status 
<persistent://yo/logs/full-partition-2>
Offload has not been run for <persistent://yo/logs/full-partition-2> since 
broker startup
➜ kannar@pond  ~/pulsar/logstash-output-pulsar/pulsar/conf git:(master) ✗ 
../bin/pulsar-admin topics offload-status 
<persistent://yo/logs/full-partition-2>
HTTP 401 Unauthorized

Reason: HTTP 401 Unauthorized
```
----
2019-03-06 16:07:07 UTC - Alexandre DUVAL: The configurations are the same for 
all proxies and for all brokers.
----
2019-03-06 16:07:33 UTC - Matteo Merli: Regarding journalDataSync=false, the 
write spikes are happening when the OS is flushing the page cache
----
2019-03-06 16:08:29 UTC - Matteo Merli: In any case, I’d try with a bigger EBS 
to get more throughput 
----
2019-03-06 16:09:47 UTC - Chris DiGiovanni: I actually just got this working... 
 I had to create a keystore with my CAs in it.  Created a config map to present 
the cacerts file I created to the brokers.  I then needed to add this option to 
the broker.config

```
PULSAR_EXTRA_OPTS: '"-Djavax.net.ssl.trustStore=/certs/cacerts"'
```
----
2019-03-06 16:09:59 UTC - Alexandre DUVAL: (I use JWT authentication).
----
2019-03-06 16:10:02 UTC - Chris DiGiovanni: After this, everything seems to be 
working smoothly...
+1 : Sijie Guo, jia zhai
slightly_smiling_face : Sijie Guo
----
2019-03-06 16:10:11 UTC - Alexandre DUVAL: @Matteo Merli you can't imagine the 
impatience behind the *merlimat is typing* :stuck_out_tongue:.
rolling_on_the_floor_laughing : David Kjerrumgaard, Sébastien de Melo, Laurent 
Chriqui
----
2019-03-06 16:13:11 UTC - Chris Bartholomew: OK, I get why these messages are 
still in storage. They haven't been "snapshotted" yet and the only way to do 
that is to send more messages to the topic. There is no timer to run the 
snapshot even if the topic is idle. The confusing part (to me, anyway) is that 
my topic doesn't look empty even though there are no unacked messages in it.
----
2019-03-06 16:15:05 UTC - Maarten Tielemans: Do you have any recommendation for 
the EBS type? Should I use the same for journal/ledger?
----
2019-03-06 16:16:20 UTC - Byron: &gt; partitions are automatically updated in 
producer and consumer when the number of partitions are changed.
So to be clear, any client connections that are established (producer or 
consumer) will get this info transparently? So if I am consuming a topic (so 
all partitions), it will be full transparent? Likewise a producer will publish 
a message with a key that went to, say, partition 1 before the change, and 
after the partition update, it may get routed to partition 4?
----
2019-03-06 16:17:37 UTC - Chris Bartholomew: I am guessing I can calculate the 
unacked storage by subtracting the amount from the dedup cursor. However, it 
only looks like I can get the message count, not the total size of those 
messages.
----
2019-03-06 16:19:14 UTC - Darragh: additionally are there any other commands 
you could think of that could shed some extra light on what is causing these 
tail latencies ?
----
2019-03-06 16:26:16 UTC - Sijie Guo: “storageSize”: 55313 is the size.
----
2019-03-06 16:30:57 UTC - Matteo Merli: You could use io1 for journal, with a 
higher size and st1 for ledgers (if you want to have more storage capacity per 
cost)
----
2019-03-06 16:31:54 UTC - Matteo Merli: To understand more about the latency, 
the bookies are exporting a number of stats
----
2019-03-06 16:32:43 UTC - Matteo Merli: That includes the number of flushes in 
journal, the fsync latency and more
----
2019-03-06 16:32:51 UTC - Chris Bartholomew: @Sijie Guo Thanks for your help in 
explaining this. Much appreciated.
----
2019-03-06 16:36:06 UTC - Maarten Tielemans: These are the Prometheus stats? We 
could set that up. Any particular stats to track?
----
2019-03-06 16:59:09 UTC - Matteo Merli: * `bookkeeper_server_ADD_ENTRY_count` 
for write rate entries/s
 * `bookie_WRITE_BYTES` for MB/s rates
 * `bookkeeper_server_ADD_ENTRY_REQUEST` for rate and latencies
 * `bookie_journal_JOURNAL_SYNC_count` for sync rate
 * `bookie_journal_JOURNAL_SYNC` for fsync latencies
----
2019-03-06 17:00:05 UTC - Matteo Merli: :smile:
----
2019-03-06 17:02:24 UTC - Matteo Merli: is there any difference if you just hit 
the brokers instead of going through proxy?
----
2019-03-06 17:38:16 UTC - Vikas: hey @David Kjerrumgaard, hope you're having a 
great day!
The question is regarding providing connection settings in the 
StandardRestrictedSSLContextService for connecting the NiFi to SSL enabled 
pulsar.

The Pulsar is installed on the VMs, and the operations guy has provided me the 
`ca.cert.pem`. I am not sure what all to do with this file.

I need to provide the Keystore and Truststore's : Filename, Password and Type 
in the StandardRestrictedSSLContextService service in NiFi
----
2019-03-06 17:40:22 UTC - Vikas: 
----
2019-03-06 17:41:49 UTC - Grant Wu: @Matteo Merli Is there a way to subscribe 
to a topic with RFC 3986 Reserved characters through the Websocket API?
----
2019-03-06 17:43:07 UTC - Matteo Merli: You have to URL-encode the topic name
----
2019-03-06 17:44:10 UTC - Grant Wu: Hrm… I tried that and it seemed like the 
URL-encoded topic was getting subscribed to instead
----
2019-03-06 17:45:00 UTC - Grant Wu: Yeah, that appears to be the behavior I’m 
getting :confused:
----
2019-03-06 17:46:31 UTC - Vikas: hey @Matteo Merli, hope you're having a great 
day!
The question is regarding providing connection settings in the 
StandardRestrictedSSLContextService for connecting the NiFi to SSL enabled 
pulsar.

The Pulsar is installed on the VMs, and the operations guy has provided me the 
`ca.cert.pem`. I am not sure what all to do with this file to connect to the 
Pulsar hosts.

I need to provide the Keystore and Truststore's : Filename, Password and Type 
in the StandardRestrictedSSLContextService service in NiFi as below.
----
2019-03-06 17:47:17 UTC - Vikas: 
----
2019-03-06 17:48:00 UTC - Matteo Merli: I’m really not familiar with the NiFi 
side of things :confused:
----
2019-03-06 17:48:40 UTC - Vikas: no worries, thanks. I'll check with David K 
when he is online, thanks :slightly_smiling_face:
----
2019-03-06 17:48:43 UTC - Matteo Merli: Uhm, there should be some example or 
tests through the code that use that
----
2019-03-06 17:48:48 UTC - David Kjerrumgaard: @Vikas You will first need to 
obtain a copy of BOTH the keystore and truststore files and copy them onto the 
VM running NiFi.
----
2019-03-06 17:49:00 UTC - Matteo Merli: let me see if I can find that
----
2019-03-06 17:49:35 UTC - David Kjerrumgaard: Then you can configure the 
"filename" properties to point to those files on the local filesystem (VM's 
filesystem)
----
2019-03-06 17:50:12 UTC - David Kjerrumgaard: This blog post walks through the 
process in greater detail.
----
2019-03-06 17:50:13 UTC - David Kjerrumgaard: 
<http://www.treselle.com/blog/apache-nifi-data-crawling-from-https-websites/>
----
2019-03-06 17:52:19 UTC - Vikas: wonderful, thanks so much @David Kjerrumgaard
----
2019-03-06 17:52:22 UTC - David Kjerrumgaard: This blog post walks through 
setting up the SSL_Context_Service as well.
----
2019-03-06 17:52:23 UTC - David Kjerrumgaard: 
<https://bryanbende.com/development/2017/10/13/apache-nifi-tls-with-apache-solr>
----
2019-03-06 17:52:52 UTC - David Kjerrumgaard: Bottom line, I think you need to 
go back to your admin and get the proper files first
----
2019-03-06 17:53:18 UTC - Vikas: sure, I am struggling with this since 
yesterday. I was following this webpage:
<https://pulsar.apache.org/docs/en/security-tls-authentication/>
----
2019-03-06 17:53:31 UTC - Vikas: "Creating client certificates"
----
2019-03-06 17:56:24 UTC - Grant Wu: So I fired up Chrome Inspector - this is 
the URL I’m using for the websocket -
```
<ws://pulsar-broker.petuum-system:8080/ws/v2/consumer/persistent/public/default/testing1%5B%5D/2d60de3c-2202-46e3-9229-0f214fb9ca75>
```
----
2019-03-06 17:57:27 UTC - Alexandre DUVAL: I tried, now I think it's about 
authentication between brokers.
----
2019-03-06 17:57:39 UTC - Matteo Merli: yes, looks correct..
----
2019-03-06 17:57:57 UTC - Grant Wu: After running `bin/pulsar-admin topics list 
public/default` I get `<persistent://public/default/testing1%5B%5D>` as the new 
topic in the list
----
2019-03-06 17:58:53 UTC - Grant Wu: I haven’t verified that the `ws` module I’m 
using doesn’t do URLencoding on its own.  But I really doubt it, because I was 
causing exceptions in Pulsar when I didn’t URLencode by myself
----
2019-03-06 17:59:25 UTC - Alexandre DUVAL: It is, i disabled auth on brokers, 
and now it's work. Will enable it again and try to understand on which conf 
field I'm wrong.
----
2019-03-06 18:00:20 UTC - Grant Wu: Hrm.  They are using the URL constructor…
----
2019-03-06 18:01:32 UTC - Grant Wu: Let me try setting a breakpoint in the `ws` 
module internals
----
2019-03-06 18:02:53 UTC - Alexandre DUVAL: Does exist a custom role for broker 
to broker authorization?
----
2019-03-06 18:03:06 UTC - David Kjerrumgaard: @Vikas I would suggesting 
following the steps in the second blog post I posted. It uses the nifi-toolkit 
to generate all the files you need in a single command, including the cert 
which you can then use to secure Pulsar as well
----
2019-03-06 18:05:16 UTC - David Kjerrumgaard: by updating the following 
property in `proxy.conf` file:
----
2019-03-06 18:05:17 UTC - David Kjerrumgaard: 
brokerClientAuthenticationParameters=tlsCertFile:/path/to/proxy.cert.pem,tlsKeyFile:/path/to/proxy.key-pk8.pem
----
2019-03-06 18:05:32 UTC - Matteo Merli: :+1:
----
2019-03-06 18:11:31 UTC - Matteo Merli: Broker will use these plugin and 
credentials

```
brokerClientAuthenticationPlugin=
brokerClientAuthenticationParameters=
```
----
2019-03-06 18:11:38 UTC - Matteo Merli: (when talking to other brokers)
----
2019-03-06 18:12:21 UTC - Alexandre DUVAL: Yes, but if I'm using JWT, I need to 
place super role token here?
----
2019-03-06 18:12:32 UTC - Alexandre DUVAL: Or does it exists another role for 
this?
----
2019-03-06 18:13:18 UTC - Matteo Merli: There’s no pre-defined role. But broker 
should be using a token whose “subject” is listed as one of the “super-user” 
roles
----
2019-03-06 18:16:31 UTC - Vikas: sure @David Kjerrumgaard, I have created 
certificates using the NiFi toolkit
----
2019-03-06 18:16:37 UTC - Vikas: ```bash-3.2$ ls -al
total 32
drwx------  7 vsingh  2074273240   224 Mar  6 11:07 .
drwxr-xr-x@ 9 vsingh  2074273240   288 Mar  6 11:07 ..
-rw-------  1 vsingh  2074273240  3437 Mar  6 11:07 CN=bbende_OU=NIFI.p12
-rw-------  1 vsingh  2074273240    43 Mar  6 11:07 CN=bbende_OU=NIFI.password
drwx------  5 vsingh  2074273240   160 Mar  6 11:07 localhost
-rw-------  1 vsingh  2074273240  1200 Mar  6 11:07 nifi-cert.pem
-rw-------  1 vsingh  2074273240  1675 Mar  6 11:07 nifi-key.key
bash-3.2$ cd localhost/
bash-3.2$ ls -al
total 40
drwx------  5 vsingh  2074273240    160 Mar  6 11:07 .
drwx------  7 vsingh  2074273240    224 Mar  6 11:07 ..
-rw-------  1 vsingh  2074273240   3076 Mar  6 11:07 keystore.jks
-rw-------  1 vsingh  2074273240  11283 Mar  6 11:07 nifi.properties
-rw-------  1 vsingh  2074273240    911 Mar  6 11:07 truststore.jks```
----
2019-03-06 18:19:06 UTC - David Kjerrumgaard: Great, Now you need to use the 
nifi-cert.pem as the certificate on your Pulsar proxy, e.g. 
`brokerClientAuthenticationParameters=tlsCertFile:/path/to/nifi-cert.pem` and 
restart Pulsar
----
2019-03-06 18:19:41 UTC - Vikas: but where can I find the Keystore and 
Truststore password. Where do I need to provide the `ca.cert.pem` which I got 
from the Pulsar admin. Sorry for all the lame questions as I am doing and 
learning this for the first time :neutral_face:
----
2019-03-06 18:20:29 UTC - David Kjerrumgaard: No worries. Is this a test Pulsar 
cluster that you can access and modify as I am suggesting?
----
2019-03-06 18:20:59 UTC - Vikas: I can't access the Pulsar cluster
----
2019-03-06 18:23:03 UTC - David Kjerrumgaard: Ok, in THAT case you will need to 
contact the person in charge of securing that cluster, and ask them for the 
client keystore and truststore files in addition to the certificate file they 
already provided you.
----
2019-03-06 18:29:46 UTC - Vikas: oh ok sure :+1:
----
2019-03-06 18:30:09 UTC - Alexandre DUVAL: Should I url encode the token ? I 
defined 
`brokerClientAuthenticationParameters=file:///home/pulsar/apache-pulsar-2.3.0/conf/keys/broker.to.broker.token`
 but I got `Caused by: java.lang.IllegalArgumentException: Illegal character(s) 
in message header value: Bearer &lt;TOKEN_VALUE&gt;`.
----
2019-03-06 18:31:30 UTC - Matteo Merli: The token should already be in base64
----
2019-03-06 18:34:39 UTC - Alexandre DUVAL: `broker.to.broker.token` contains 
the exact output of its creation with `pulsar tokens create`.
----
2019-03-06 18:37:36 UTC - Alexandre DUVAL: `Illegal character(s) in message 
header value: Bearer 
eyJhbGczafzaUzI1NiJ9.kjozeajohfoaZAFAF.iKiHoo7J1Ge_G8JYau_4hUmBzSErTqhe3pye8BUrPg0
 ` (I randomly modified the chars in the token, excepts `.` and `_`.
----
2019-03-06 18:38:57 UTC - Matteo Merli: and where is the exception being thrown?
----
2019-03-06 18:39:21 UTC - Alexandre DUVAL: 
`java.util.concurrent.ExecutionException: 
org.apache.pulsar.client.admin.PulsarAdminException:`
----
2019-03-06 18:39:52 UTC - Alexandre DUVAL: Do you want all the stack?
----
2019-03-06 18:40:35 UTC - Matteo Merli: yes, that would help
----
2019-03-06 18:41:41 UTC - Alexandre DUVAL: @Matteo Merli more readable here.
----
2019-03-06 18:43:06 UTC - Matteo Merli: The strange thing are the `_` 
:slightly_smiling_face:
----
2019-03-06 18:43:18 UTC - Matteo Merli: I haven’t seen them when generating 
tokens
----
2019-03-06 18:43:51 UTC - Matteo Merli: Can you try remove them (just for the 
sake of seeing if they are the problem here) ?
----
2019-03-06 18:44:35 UTC - Alexandre DUVAL: They appear when you use `.` in your 
subject.
----
2019-03-06 18:44:38 UTC - Alexandre DUVAL: I'm trying.
----
2019-03-06 18:48:12 UTC - Alexandre DUVAL: Same issue without.
----
2019-03-06 18:49:35 UTC - Matteo Merli: Ok, but client passing these tokens 
works, right?
----
2019-03-06 18:49:44 UTC - Alexandre DUVAL: Yes.
----
2019-03-06 18:49:50 UTC - Matteo Merli: Can you get a tcdpump of both cases?
----
2019-03-06 18:50:10 UTC - Matteo Merli: tcpdump -i any -w /tmp/test.pcap -s 0 
port 6650 -v
----
2019-03-06 18:50:24 UTC - Alexandre DUVAL: On the broker?
----
2019-03-06 18:52:18 UTC - Matteo Merli: Yes
----
2019-03-06 18:54:46 UTC - Alexandre DUVAL: Hum, I'm not a tcpdump master but 
I'm using wireguard so the dump will be encrypted :confused:.
----
2019-03-06 18:55:06 UTC - Alexandre DUVAL: I'll run it from the broker.
----
2019-03-06 18:55:54 UTC - Matteo Merli: Oh is it going through TLS  ?
----
2019-03-06 18:56:42 UTC - Alexandre DUVAL: It is, yes.
----
2019-03-06 18:57:02 UTC - Matteo Merli: ok.. that makes it difficult then
----
2019-03-06 19:01:11 UTC - Alexandre DUVAL: You agree that the bearer token 
should be passed without the prefix `token:`, right?
----
2019-03-06 19:01:42 UTC - Matteo Merli: Correct
----
2019-03-06 19:02:58 UTC - Matteo Merli: it should be `Authorization: Bearer 
xxxx.aaaaa.zzzzzz`
----
2019-03-06 19:06:30 UTC - Alexandre DUVAL: Sure.
----
2019-03-06 19:08:58 UTC - Alexandre DUVAL: What do you think? I failed 
something in the token generation which would be very sad :confused:. Or 
something on pulsar side?
----
2019-03-06 19:11:39 UTC - Matteo Merli: Not sure. I’d try to see that without 
TLS to check the HTTP request
----
2019-03-06 19:12:02 UTC - Matteo Merli: in both cases, and understand the 
difference
----
2019-03-06 19:13:00 UTC - Grant Wu: It seems that `ws` isn’t doing any 
additional escaping
----
2019-03-06 19:13:11 UTC - Grant Wu: It’s just passing the URL literally to 
node’s `http.get`
----
2019-03-06 19:16:25 UTC - Grant Wu: Yeah, @Matteo Merli, I don’t think it’s 
working, this seems like a smoking gun to me:

```
19:14:58.299 [pulsar-client-io-53-6] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://public/default/pusheennstormy%5B%5D>][3ab32f35-6c3f-4e8a-a120-409091bb3cea]
 Subscribing to topic on cnx [id: 0x190c12dc, L:/10.244.1.53:37080 - 
R:10.244.2.32/10.244.2.32:6650]
19:14:58.358 [pulsar-client-io-53-6] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[<persistent://public/default/pusheennstormy%5B%5D>][3ab32f35-6c3f-4e8a-a120-409091bb3cea]
 Subscribed to topic on 10.244.2.32/10.244.2.32:6650 -- consumer: 3
19:14:58.359 [pulsar-web-30-25] INFO  org.eclipse.jetty.server.RequestLog - 
10.244.0.61 - - [06/Mar/2019:19:14:58 +0000] "GET 
//pulsar-broker.petuum-system:8080/ws/v2/consumer/persistent/public/default/pusheennstormy%5B%5D/3ab32f35-6c3f-4e8a-a120-409091bb3cea
 HTTP/1.1" 101 0 "-" "-"  70
19:14:58.359 [pulsar-web-30-25] INFO  
org.apache.pulsar.websocket.AbstractWebSocketHandler - [/10.244.0.61:46224] New 
WebSocket session on topic <persistent://public/default/pusheennstormy%5B%5D>
```
----
2019-03-06 19:20:16 UTC - Matteo Merli: Ok. I don’t know if there’s an easy fix 
for that
----
2019-03-06 19:20:56 UTC - Matteo Merli: Possibly in the websocket handler to 
ensure URLencoded names are decoded
----
2019-03-06 19:22:05 UTC - Grant Wu: :disappointed:
----
2019-03-06 19:22:10 UTC - Grant Wu: I guess I should file a bug for this
----
2019-03-06 19:23:51 UTC - Grant Wu: 
<https://github.com/apache/pulsar/issues/3768> wait, is there a known deadlock 
issue with Pulsar Functions?
----
2019-03-06 19:23:58 UTC - Grant Wu: @Rajan Dhabalia?
----
2019-03-06 19:23:58 UTC - Alexandre DUVAL: I removed the TLS to test, but 
wireguard still doing its job using UDP packets so can't provide it.
----
2019-03-06 19:42:35 UTC - Matteo Merli: UDP ?
----
2019-03-06 19:43:28 UTC - Matteo Merli: got it. But if you do the capture on 
broker host, you’ll have the clear text tcp stream
----
2019-03-06 19:47:52 UTC - Alexandre DUVAL: My bad you are right, was on wrong 
port.
----
2019-03-06 19:47:54 UTC - Alexandre DUVAL: ```GET 
/admin/v2/non-persistent/yo/logs HTTP/1.1                                       
             
Authorization:.BearerieyJhbGrgargiJIUzI1NiJ9.eyJzdWgagJzdXBlciJ9.sqdsdf-70QuYZtvncbYY4M7oL0
User-Agent: Jersey/2.27 (HttpUrlConnection 1.8.0_192)                           
                )
Host: 
<http://c1-pulsar-yo-customers.services.yo.com:2000|c1-pulsar-yo-customers.services.yo.com:2000>
Accept: application/json
Via: http/1.1 yo-pulsar-c1-n4
X-Forwarded-For: 10.2.0.1
X-Forwarded-Proto: https
X-Forwarded-Host: 
<http://c1-pulsar-yo-customers.services.yo.com:2000|c1-pulsar-yo-customers.services.yo.com:2000>
X-Forwarded-Server: 10.2.1.4
X-Original-Principal: super   ```
----
2019-03-06 19:49:17 UTC - Alexandre DUVAL: It is not the same token Oo.
----
2019-03-06 19:49:36 UTC - Matteo Merli: This is through proxy though
----
2019-03-06 19:50:07 UTC - Matteo Merli: also the header looks messed up
----
2019-03-06 19:51:00 UTC - Alexandre DUVAL: Yes, I change my client.conf to hit 
the broker and I retry.
----
2019-03-06 20:55:42 UTC - Ali Ahmed: @Grant Wu this should fix the issue
<https://github.com/apache/pulsar/pull/3772>
----
2019-03-06 20:56:20 UTC - Grant Wu: But is there an underlying deadlock in 
function-worker?
----
2019-03-06 20:56:35 UTC - Grant Wu: I’m wondering if it could be related to 
<https://github.com/apache/pulsar/issues/3715>
----
2019-03-06 20:57:43 UTC - Ali Ahmed: I don’t think that’s related
----
2019-03-06 21:04:08 UTC - Jerry Peng: @Grant Wu that only happens when running 
function via ThreadRuntime which is not the default. Its not a deadlock per se. 
 Just takes extra long occasionally to stop a function instance running as a 
thread in the worker process since in java there isn’t a great way to just kill 
a thread.
----
2019-03-06 23:39:03 UTC - jia zhai: will update the doc by issue #3773
----
2019-03-07 00:34:34 UTC - Ali Ahmed: I am experimenting with supporting windows 
builds for pulsar, made some progress
<https://github.com/aahmed-se/incubator-pulsar/blob/win1/.appveyor.yml>
----
2019-03-07 00:34:49 UTC - Ali Ahmed: 
<https://ci.appveyor.com/project/aahmed-se/incubator-pulsar>
----
2019-03-07 00:35:23 UTC - Ali Ahmed: I need to find a boost-python package for 
windows am not able to locate one in mysys2
----
2019-03-07 00:35:32 UTC - Ali Ahmed: it anyone know of one let me know
----

Slack digest for #general - 2019-03-07

Reply via email to