Why is Spark getting Kafka data out from port 2181 ?
I notice that some Spark programs would contact something like 'zoo1:2181' when trying to suck data out of Kafka. Does the kafka data actually transported out over this port ? Typically Zookeepers use 2218 for SSL. If my Spark program were to use 2218, how would I specify zookeeper specific truststore in my Spark config ? Do I just give -D flags via JAVA_OPTS ? Thx -- -eric ho
how to pass trustStore path into pyspark ?
I'm trying to pass a trustStore pathname into pyspark. What env variable and/or config file or script I need to change to do this ? I've tried setting JAVA_OPTS env var but to no avail... any pointer much appreciated... thx -- -eric ho
Re: how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?
I'm interested in what I should put into the trustStore file, not just for Spark but also for Kafka and Cassandra sides.. The way I generated self-signed certs for Kafka and Cassandra sides are slightly different... On Thu, Sep 1, 2016 at 1:09 AM, Eric Ho <e...@analyticsmd.com> wrote: > A working example would be great... > Thx > > -- > > -eric ho > > -- -eric ho
how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?
A working example would be great... Thx -- -eric ho
KeyManager exception in Spark 1.6.2
I was trying to enable SSL in Spark 1.6.2 and got this exception. Not sure if I'm missing something or my keystore / truststore files got bad although keytool showed that both files are fine... = *16/09/01 04:01:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable* *Exception in thread "main" java.security.KeyManagementException: Default SSLContext is initialized automatically* *at sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)* *at javax.net.ssl.SSLContext.init(SSLContext.java:282)* *at org.apache.spark.SecurityManager.(SecurityManager.scala:284)* *at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1121)* *at org.apache.spark.deploy.master.Master$.main(Master.scala:1106)* *at org.apache.spark.deploy.master.Master.main(Master.scala)* = -- -eric ho
Spark to Kafka communication encrypted ?
I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark to Kafka communication ... I think that the Spark docs only tells you how to turn on encryption for inter Spark node communications .. Am I wrong ? Thanks. -- -eric ho
Do we still need to use Kryo serializer in Spark 1.6.2 ?
I heard that Kryo will get phased out at some point but not sure which Spark release. I'm using PySpark, does anyone has any docs on how to call / use Kryo Serializer in PySpark ? Thanks. -- -eric ho
Re: How to do nested for-each loops across RDDs ?
Thanks Daniel. Do you have any code fragments on using CoGroups or Joins across 2 RDDs ? I don't think that index would help much because this is an N x M operation, examining each cell of each RDD. Each comparison is complex as it needs to peer into a complex JSON On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman <daniel.imber...@gmail.com> wrote: > There's no real way of doing nested for-loops with RDD's because the whole > idea is that you could have so much data in the RDD that it would be really > ugly to store it all in one worker. > > There are, however, ways to handle what you're asking about. > > I would personally use something like CoGroup or Join between the two > RDDs. if index matters, you can use ZipWithIndex on both before you join > and then see which indexes match up. > > On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote: > >> I've nested foreach loops like this: >> >> for i in A[i] do: >> for j in B[j] do: >> append B[j] to some list if B[j] 'matches' A[i] in some fashion. >> >> Each element in A or B is some complex structure like: >> ( >> some complex JSON, >> some number >> ) >> >> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)), >> how would my code look ? >> Are there any RRD operators that would allow me to loop thru both RRDs >> like the above procedural code ? >> I can't find any RRD operators nor any code fragments that would allow me >> to do this. >> >> Thing is: by that time I composed RRD(A), this RRD would have contain >> elements in array B as well as array A. >> Same argument for RRD(B). >> >> Any pointers much appreciated. >> >> Thanks. >> >> >> -- >> >> -eric ho >> >> -- -eric ho
How to do nested for-each loops across RDDs ?
I've nested foreach loops like this: for i in A[i] do: for j in B[j] do: append B[j] to some list if B[j] 'matches' A[i] in some fashion. Each element in A or B is some complex structure like: ( some complex JSON, some number ) Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)), how would my code look ? Are there any RRD operators that would allow me to loop thru both RRDs like the above procedural code ? I can't find any RRD operators nor any code fragments that would allow me to do this. Thing is: by that time I composed RRD(A), this RRD would have contain elements in array B as well as array A. Same argument for RRD(B). Any pointers much appreciated. Thanks. -- -eric ho
how to do nested loops over 2 arrays but use Two RDDs instead ?
Hi, I've two nested-for loops like this: *for all elements in Array A do:* *for all elements in Array B do:* *compare a[3] with b[4] see if they 'match' and if match, return that element;* If I were to represent Arrays A and B as 2 separate RDDs, how would my code look like ? I couldn't find any RDD functions that would do this for me efficiently. I don't really want elements of RDD(A) and RDD(B) flying all over the network piecemeal... THanks. -- -eric ho
com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??
Can I specify this in my build file ? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-datastax-spark-spark-streaming-2-10-1-1-0-in-my-build-sbt-tp22758.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
No logs from my cluster / worker ... (running DSE 4.6.1)
I'm submitting this via 'dse spark-submit' but somehow, I don't see any loggings in my cluster or worker machines... How can I find out ? My cluster is running DSE 4.6.1 with Spark enabled. My source is running Kafka 0.8.2.0 I'm launching my program on one of my DSE machines. Any insights much appreciated. Thanks. - cas1.dev% dse spark-submit --verbose --deploy-mode cluster --master spark://cas1.dev.kno.com:7077 --class com.kno.highlights.counting.service.HighlightConsumer --driver-class-path /opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib --driver-library-path /opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib --properties-file /tmp/highlights-counting.properties /opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib/kno-highlights-counting-service.kno-highlights-counting-service-0.1.jar --name HighlightConsumer Using properties file: /tmp/highlights-counting.properties Warning: Ignoring non-spark config property: checkpoint_directory=checkpointForHighlights Warning: Ignoring non-spark config property: zookeeper_port=2181 Warning: Ignoring non-spark config property: default_num_of_cores_per_topic=1 Warning: Ignoring non-spark config property: num_of_concurrent_streams=2 Warning: Ignoring non-spark config property: kafka_consumer_group=highlight_consumer_group Warning: Ignoring non-spark config property: app_name=HighlightConsumer Warning: Ignoring non-spark config property: cassandra_keyspace=bookevents Warning: Ignoring non-spark config property: scheduler_mode=FIFO Warning: Ignoring non-spark config property: highlight_topic=highlight_topic Warning: Ignoring non-spark config property: cassandra_host=cas1.dev.kno.com Warning: Ignoring non-spark config property: checkpoint_interval=3 Warning: Ignoring non-spark config property: zookeeper_host=cas1.dev.kno.com Adding default property: spark_master=spark://cas1.dev.kno.com:7077 Warning: Ignoring non-spark config property: streaming_window=10 Using properties file: /tmp/highlights-counting.properties Warning: Ignoring non-spark config property: checkpoint_directory=checkpointForHighlights Warning: Ignoring non-spark config property: zookeeper_port=2181 Warning: Ignoring non-spark config property: default_num_of_cores_per_topic=1 Warning: Ignoring non-spark config property: num_of_concurrent_streams=2 Warning: Ignoring non-spark config property: kafka_consumer_group=highlight_consumer_group Warning: Ignoring non-spark config property: app_name=HighlightConsumer Warning: Ignoring non-spark config property: cassandra_keyspace=bookevents Warning: Ignoring non-spark config property: scheduler_mode=FIFO Warning: Ignoring non-spark config property: highlight_topic=highlight_topic Warning: Ignoring non-spark config property: cassandra_host=cas1.dev.kno.com Warning: Ignoring non-spark config property: checkpoint_interval=3 Warning: Ignoring non-spark config property: zookeeper_host=cas1.dev.kno.com Adding default property: spark_master=spark://cas1.dev.kno.com:7077 Warning: Ignoring non-spark config property: streaming_window=10 Parsed arguments: master spark://cas1.dev.kno.com:7077 deployMode cluster executorMemory null executorCores null totalExecutorCores null propertiesFile /tmp/highlights-counting.properties extraSparkPropertiesMap() driverMemorynull driverCores null driverExtraClassPath /opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib driverExtraLibraryPath /opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass com.kno.highlights.counting.service.HighlightConsumer primaryResource file:/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib/kno-highlights-counting-service.kno-highlights-counting-service-0.1.jar name com.kno.highlights.counting.service.HighlightConsumer childArgs [--name HighlightConsumer] jarsnull verbose true Default properties from /tmp/highlights-counting.properties: spark_master - spark://cas1.dev.kno.com:7077 Using properties file: /tmp/highlights-counting.properties Warning: Ignoring non-spark config property: checkpoint_directory=checkpointForHighlights Warning: Ignoring non-spark config property: zookeeper_port=2181 Warning: Ignoring non-spark config property: default_num_of_cores_per_topic=1 Warning: Ignoring non-spark config property: num_of_concurrent_streams=2 Warning: Ignoring non-spark config property: kafka_consumer_group=highlight_consumer_group Warning: Ignoring non-spark config property: