Slack digest for #general - 2019-05-09

Apache Pulsar Slack Thu, 09 May 2019 02:23:55 -0700

2019-05-08 10:03:11 UTC - James Creedy: I'm trying to get the go client 
working. Not primarly a go engineer so might be something go related.


brew list libpulsar
/usr/local/Cellar/libpulsar/2.3.1/include/pulsar/ (41 files)
/usr/local/Cellar/libpulsar/2.3.1/lib/libpulsar.2.3.1.dylib
/usr/local/Cellar/libpulsar/2.3.1/lib/ (2 other files)

dep init
  Using ^2.3.1 as constraint for direct dep 
<http://github.com/apache/pulsar|github.com/apache/pulsar>
  Locking in v2.3.1 (7a6f2cc) for direct dep 
<http://github.com/apache/pulsar|github.com/apache/pulsar>
  Locking in v2.1 (a96e638) for transitive dep 
<http://gopkg.in/natefinch/lumberjack.v2|gopkg.in/natefinch/lumberjack.v2>
  Locking in v1.4.1 (8bdbc7b) for transitive dep 
<http://github.com/sirupsen/logrus|github.com/sirupsen/logrus>
  Locking in v1.0.2 (f55edac) for transitive dep 
<http://github.com/konsorten/go-windows-terminal-sequences|github.com/konsorten/go-windows-terminal-sequences>
  Locking in master (ecd444e) for transitive dep 
<http://golang.org/x/sys|golang.org/x/sys>

 go run main.go
# 
<http://github.com/apache/pulsar/pulsar-client-go/pulsar|github.com/apache/pulsar/pulsar-client-go/pulsar>
../github.com/apache/pulsar/pulsar-client-go/pulsar/c_consumer.go:104:3: could 
not determine kind of name for 
C.pulsar_configure_set_negative_ack_redelivery_delay_ms
../github.com/apache/pulsar/pulsar-client-go/pulsar/c_consumer.go:133:4: could 
not determine kind of name for C.pulsar_consumer_configuration_set_schema_info
../github.com/apache/pulsar/pulsar-client-go/pulsar/c_consumer.go:313:2: could 
not determine kind of name for C.pulsar_consumer_negative_acknowledge
../github.com/apache/pulsar/pulsar-client-go/pulsar/c_consumer.go:318:2: could 
not determine kind of name for C.pulsar_consumer_negative_acknowledge_id
../github.com/apache/pulsar/pulsar-client-go/pulsar/c_consumer.go:133:58: could 
not determine kind of name for C.pulsar_schema_type
----
2019-05-08 10:03:56 UTC - James Creedy: Seems like there is some mismatch 
between the CPP and go wrapper.
----
2019-05-08 15:29:00 UTC - Grant Wu: What version of the C++ client do you have 
installed?
----
2019-05-08 15:29:07 UTC - Grant Wu: iirc making sure they’re matching is fully 
manual at this point
----
2019-05-08 18:09:17 UTC - Ryan Samo: Hey guys, we have a Spark job that uses 
the PulsarKafkaProducer to produce messages. Our problem is that if we take a 
Broker node down for patching etc., the spark job is not attempting to 
reconnect to another Broker which is up and waiting for connections, instead, 
the jobs just sit there and attempt a connection retry on the Broker they were 
originally connected to. Is there a way to break this retry behavior and have 
them immediately find another broker?

Here is the exception we see over and over with increasing times on the 
retries. Adding 5 seconds on each retry...

2019/05/07 13:26:14 WARN ConnectionHandler pulsar-client-io-1-1: Could not get 
connection to broker: org.apache.pulsar.client.api.PulsarClientException: 
java.util.concurrent.CompletionException: 
org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused:127.0.0.1:16651 -- Will try 
again in 11.638 s

 

2019/05/07 13:26:18 ERROR PulsarWriter Executor task launch worker for task 
2002016: Error writing Pulsar

java.util.concurrent.TimeoutException

        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)

        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)

        at com.pulsar.PulsarWriter.writePartition(PulsarWriter.java:56)

        at 
org.apachient.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)

        at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)

        at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)

        at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)

        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)

        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

        at org.apache.spark.scheduler.Task.run(Task.scala:99)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

2019/05/07 13:26:23 ERROR PulsarWriter Executor task launch worker for task 
2002016: Error writing Kafka

java.util.concurrent.TimeoutException

        at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)

        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)

        at com.pulsar.PulsarWriter.writePartition(PulsarWriter.java:56)

        at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)

        at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)

        at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)

        at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)

        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)

        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

        at org.apache.spark.scheduler.Task.run(Task.scala:99)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)
----
2019-05-08 18:11:28 UTC - Matteo Merli: @Ryan Samo What is the “serviceUrl” set 
in the client?
----
2019-05-08 18:11:41 UTC - Matteo Merli: that’s what gets used for service 
discovery
----
2019-05-08 18:13:13 UTC - Ryan Samo: It is a DNS entry that has 6 brokers 
behind it
----
2019-05-08 18:15:48 UTC - Matteo Merli: Ok, that should work. Is there any 
longer timeout on the DNS?

The client behavior should be to try all the IPs returned in the DNS record, in 
sequence.

Can you share the complete client logs when the reconnections are attempted?
----
2019-05-08 18:16:10 UTC - Matteo Merli: Or tcddump on port 53 and port 6650
----
2019-05-08 18:16:37 UTC - Ryan Samo: Let me see what I can come up with, thanks 
for the help
----
2019-05-08 18:17:04 UTC - Ryan Samo: Just the Spark stuff has this issue by the 
way, normal clients have been fine
----
2019-05-08 18:19:41 UTC - Matteo Merli: Under the hood it would always be the 
same code :slightly_smiling_face:
----
2019-05-08 18:20:23 UTC - Ryan Samo: Yeah that’s what I saw too. The DNS is set 
to standard timeout too
----
2019-05-08 18:21:32 UTC - Ryan Samo: What about 
maxNumberOfRejectedRequestPerConnection? Would that come into play here?
----
2019-05-08 18:22:01 UTC - Matteo Merli: No, that would just a be a way to slow 
down reconnections
----
2019-05-08 18:22:19 UTC - Ryan Samo: Ok thanks
----
2019-05-08 18:31:09 UTC - Ryan Samo: I do see that they are using the 
bootstrap.servers command in Spark and setting it to the DNS, then handing that 
to Pulsar client. I wonder if it’s caching it somehow?
----
2019-05-08 18:40:31 UTC - Matteo Merli: That shouldn’t be resolved at that 
point. That should be passed down as “serviceUrl” to client library
----
2019-05-08 18:41:33 UTC - Ryan Samo: Right... I bet this is their problem. 
Thanks
----
2019-05-08 18:54:27 UTC - Matteo Merli: Check if the string is converted into 
an `InetAddress` at any point
----
2019-05-08 18:54:54 UTC - Matteo Merli: that would trigger the DNS resolution 
and it would stick with that
----
2019-05-08 18:55:52 UTC - Ryan Samo: Ok will do
----
2019-05-08 19:00:20 UTC - Ryan Samo: Yup, you’re right. That’s exactly what’s 
going on. Thanks dude!
----
2019-05-08 19:19:03 UTC - Matteo Merli: Did you get any exception on broker 
side?
----
2019-05-08 19:19:22 UTC - Alexandre DUVAL: @Matteo Merli that the only log of 
error on broker
----
2019-05-08 19:19:35 UTC - Alexandre DUVAL: no stacktrace
----
2019-05-08 19:20:48 UTC - Matteo Merli: :confused:
----
2019-05-08 19:20:58 UTC - Matteo Merli: there should be a log.error line before 
the `throw new RestException(Status.INTERNAL_SERVER_ERROR, e.getMessage());`
----
2019-05-08 19:22:13 UTC - Matteo Merli: So is this simply updating the schema 
with auth on?
----
2019-05-08 19:23:05 UTC - Alexandre DUVAL: yes
----
2019-05-08 19:23:15 UTC - Alexandre DUVAL: I can ensure you there is no 
stacktrace on broker
----
2019-05-08 19:23:39 UTC - Alexandre DUVAL: ````14:41:16.961 
[ForkJoinPool.commonPool-worker-3] INFO  org.eclipse.jetty.server.RequestLog - 
192.168.10.48 - - [07/May/2019:14:41:16 +0000] "POST 
/admin/v2/schemas/clevercloud/logs/full/schema HTTP/1.1" 500 386 "-" 
"PostmanRuntime/7.11.0" 17                                                      
                       
14:41:28.654 [ForkJoinPool.commonPool-worker-2] INFO  
org.eclipse.jetty.server.RequestLog - 192.168.10.48 - - [07/May/2019:14:41:28 
+0000] "POST /admin/v2/schemas/clevercloud/logs/full/schema HTTP/1.1" 500 386 
"-" "PostmanRuntime/7.11.0" 13                                                  
                           
14:41:29.009 [ForkJoinPool.commonPool-worker-2] INFO  
org.eclipse.jetty.server.RequestLog - 192.168.10.48 - - [07/May/2019:14:41:28 
+0000] "POST /admin/v2/schemas/clevercloud/logs/full/schema HTTP/1.1" 500 386 
"-" "PostmanRuntime/7.11.0" 14                                                  
                           
14:41:29.353 [ForkJoinPool.commonPool-worker-2] INFO  
org.eclipse.jetty.server.RequestLog - 192.168.10.48 - - [07/May/2019:14:41:29 
+0000] "POST /admin/v2/schemas/clevercloud/logs/full/schema HTTP/1.1" 500 386 
"-" "PostmanRuntime/7.11.0" 15   ````
----
2019-05-08 19:23:42 UTC - Alexandre DUVAL: for 4 requests
----
2019-05-08 19:23:52 UTC - Alexandre DUVAL: (no log between each line)
----
2019-05-08 19:24:21 UTC - Matteo Merli: Ok, that should be easy to reproduce
----
2019-05-08 19:24:36 UTC - Matteo Merli: Could it be any error in the POST body ?
----
2019-05-08 19:25:02 UTC - Alexandre DUVAL: it could
----
2019-05-08 19:25:33 UTC - Alexandre DUVAL: the body: ```{
  "type":"AVRO",
  
"schema":"{\"type\":\"record\",\"namespace\":\"com.clevercloud\",\"name\":\"ApplicationAddonLog\",\"fields\":[{\"name\":\"@source\",\"type\":\"string\"},{\"name\":\"@source_host\",\"type\":\"string\"},{\"name\":\"@timestamp\",\"type\":\"string\"},{\"name\":\"@version\",\"type\":\"string\"},{\"name\":\"ownerId\",\"type\":\"string\"},{\"name\":\"appId\",\"type\":\"string\"},{\"name\":\"deploymentId\",\"type\":[\"null\",\"string\"]},{\"name\":\"instanceId\",\"type\":\"string\"},{\"name\":\"zone\",\"type\":\"string\"},{\"name\":\"type\",\"type\":\"string\"},{\"name\":\"tags\",\"type\":\"array\",\"items\":\"string\"},{\"name\":\"syslog_pri\",\"type\":\"string\"},{\"name\":\"syslog_program\",\"type\":\"string\"},{\"name\":\"syslog_severity_code\",\"type\":\"double\"},{\"name\":\"syslog_facility_code\",\"type\":\"double\"},{\"name\":\"syslog_facility\",\"type\":\"string\"},{\"name\":\"syslog_severity\",\"type\":\"string\"},{\"name\":\"syslog_program\",\"type\":\"string\"},{\"name\":\"message\",\"type\":\"string\"}]}"
}
```
----
2019-05-08 19:26:09 UTC - Alexandre DUVAL: But when i got error on the body, 
the response said it
----
2019-05-08 19:26:23 UTC - Matteo Merli: Ok, got it
----
2019-05-08 19:26:26 UTC - Alexandre DUVAL: like name wasnt specified or 
something like that
----
2019-05-08 19:26:52 UTC - Matteo Merli: Can you try adding the `log.error` 
lines before rethrowing thexception ?
----
2019-05-08 19:27:40 UTC - Matteo Merli: These should always be there whenever 
we rethrow the 500 errors, otherwise it’s very difficult to figure out the issue
----
2019-05-08 19:27:55 UTC - Ryan Samo: So BOOTSTRAP_SERVERS_CONFIG is being set 
for the PulsarKafkaProducer to our DNS. Looking at the Pulsar code, I don’t 
think that’s being cached
----
2019-05-08 19:28:37 UTC - Ryan Samo: But your code only targets localhost:6550 
too, ours has many brokers behind it
----
2019-05-08 19:28:48 UTC - Matteo Merli: Correct, it’s passed down as a string
----
2019-05-08 19:28:56 UTC - Alexandre DUVAL: you mean put logs error lines as 
classical debug? I thought it was a bug or permission issue but ok I will do it 
tomorrow and keep you informed.
----
2019-05-08 19:29:31 UTC - Ryan Samo: So if we pass our DNS into that, it should 
pickup all the servers behind it right? Not just the 1st
----
2019-05-08 19:29:40 UTC - Matteo Merli: Yes, we should always print the error 
log there
----
2019-05-08 19:29:54 UTC - Matteo Merli: Correct
----
2019-05-08 19:30:31 UTC - Matteo Merli: The client should try all the returned 
IPs in sequence
----
2019-05-08 19:31:08 UTC - Matteo Merli: (Checking if there’s any debug logs to 
turn on there)
----
2019-05-08 19:31:42 UTC - Ryan Samo: Weird and so in this case if it was 
writing to Broker 1 and we pull the plug, you’d expect it to move to Broker 2 
etc. 
----
2019-05-08 19:31:45 UTC - Ryan Samo: Ok thanks
----
2019-05-08 19:34:43 UTC - Matteo Merli: Uhm.. not seeing easy way. A capture on 
ports 53 and 6650 would be more helpful
----
2019-05-08 19:35:31 UTC - Ryan Samo: From the client side?
----
2019-05-08 19:35:38 UTC - Matteo Merli: Yes
----
2019-05-08 19:35:46 UTC - Ryan Samo: K thanks
----
2019-05-08 19:35:53 UTC - Matteo Merli: to see what the DNS response looks like 
and where the client attempts to connect
----
2019-05-08 19:36:07 UTC - Ryan Samo: Makes sense
----
2019-05-08 19:37:01 UTC - Alexandre DUVAL: and about the 401 on nodes whish 
have not the topic?
----
2019-05-08 19:37:42 UTC - Matteo Merli: that doesn’t look good either..
----
2019-05-08 21:20:12 UTC - Alexandre DUVAL: just tested, got 202 for the first 
time i define schema, and 500 on update with a totally different schema. There 
is backward compat to have on schema or something like that?
----
2019-05-08 21:20:14 UTC - Alexandre DUVAL: @Matteo Merli
----
2019-05-08 21:26:58 UTC - Alexandre DUVAL: even with backward compat (i.e. a 
new field, it's getting a 500)
----
2019-05-08 21:29:58 UTC - Alexandre DUVAL: oh compat problem should return 
IncompatibleSchemaException
----
2019-05-09 02:29:55 UTC - William Hammond: @William Hammond has joined the 
channel
----
2019-05-09 03:25:54 UTC - Andrew Ernst: @Andrew Ernst has joined the channel
----
2019-05-09 07:28:10 UTC - Bhavya Jindal: @Bhavya Jindal has joined the channel
----
2019-05-09 08:13:18 UTC - Shivji Kumar Jha: @Sijie Guo Can you please merge 
<https://github.com/apache/pulsar/pull/4232>? :slightly_smiling_face:
----
2019-05-09 08:32:25 UTC - jia zhai: @Shivji Kumar Jha merged it
----
2019-05-09 08:32:48 UTC - Shivji Kumar Jha: Thanks @jia zhai :blush:
----
2019-05-09 08:33:28 UTC - jia zhai: welcome, Thanks for the great work.
----
2019-05-09 08:33:32 UTC - jia zhai: :slightly_smiling_face:
----
2019-05-09 09:08:49 UTC - Alexandre DUVAL: How to set Always compatible 
strategy ? Backward, forward and full work but not this one: 
```bin/pulsar-admin namespaces set-schema-autoupdate-strategy  -c 
"AlwaysCompatible" yo/logs
null

Reason: Either --compatibility or --disabled must be specified
```
----

Slack digest for #general - 2019-05-09

Reply via email to