Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Eric Ho
I notice that some Spark programs would contact something like 'zoo1:2181'
when trying to suck data out of Kafka.

Does the kafka data actually transported out over this port ?

Typically Zookeepers use 2218 for SSL.
If my Spark program were to use 2218, how would I specify zookeeper
specific truststore in my Spark config ?  Do I just give -D flags via
JAVA_OPTS ?

Thx

-- 

-eric ho


how to pass trustStore path into pyspark ?

2016-09-02 Thread Eric Ho
I'm trying to pass a trustStore pathname into pyspark.
What env variable and/or config file or script I need to change to do this ?
I've tried setting JAVA_OPTS env var but to no avail...
any pointer much appreciated...  thx

-- 

-eric ho


Re: how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Eric Ho
I'm interested in what I should put into the trustStore file, not just for
Spark but also for Kafka and Cassandra sides..
The way I generated self-signed certs for Kafka and Cassandra sides are
slightly different...

On Thu, Sep 1, 2016 at 1:09 AM, Eric Ho <e...@analyticsmd.com> wrote:

> A working example would be great...
> Thx
>
> --
>
> -eric ho
>
>


-- 

-eric ho


how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Eric Ho
A working example would be great...
Thx

-- 

-eric ho


KeyManager exception in Spark 1.6.2

2016-08-31 Thread Eric Ho
I was trying to enable SSL in Spark 1.6.2 and got this exception.
Not sure if I'm missing something or my keystore / truststore files got bad
although keytool showed that both files are fine...

=

*16/09/01 04:01:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable*

*Exception in thread "main" java.security.KeyManagementException: Default
SSLContext is initialized automatically*

*at
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)*

*at javax.net.ssl.SSLContext.init(SSLContext.java:282)*

*at
org.apache.spark.SecurityManager.(SecurityManager.scala:284)*

*at
org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1121)*

*at org.apache.spark.deploy.master.Master$.main(Master.scala:1106)*

*at org.apache.spark.deploy.master.Master.main(Master.scala)*
=


-- 

-eric ho


Spark to Kafka communication encrypted ?

2016-08-31 Thread Eric Ho
I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark
to Kafka communication ...  I think that the Spark docs only tells you how
to turn on encryption for inter Spark node communications ..  Am I wrong ?

Thanks.

-- 

-eric ho


Do we still need to use Kryo serializer in Spark 1.6.2 ?

2016-08-22 Thread Eric Ho
I heard that Kryo will get phased out at some point but not sure which
Spark release.
I'm using PySpark, does anyone has any docs on how to call / use Kryo
Serializer in PySpark ?

Thanks.

-- 

-eric ho


Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
Thanks Daniel.
Do you have any code fragments on using CoGroups or Joins across 2 RDDs ?
I don't think that index would help much because this is an N x M
operation, examining each cell of each RDD.  Each comparison is complex as
it needs to peer into a complex JSON


On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> There's no real way of doing nested for-loops with RDD's because the whole
> idea is that you could have so much data in the RDD that it would be really
> ugly to store it all in one worker.
>
> There are, however, ways to handle what you're asking about.
>
> I would personally use something like CoGroup or Join between the two
> RDDs. if index matters, you can use ZipWithIndex on both before you join
> and then see which indexes match up.
>
> On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote:
>
>> I've nested foreach loops like this:
>>
>>   for i in A[i] do:
>> for j in B[j] do:
>>   append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>>
>> Each element in A or B is some complex structure like:
>> (
>>   some complex JSON,
>>   some number
>> )
>>
>> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
>> how would my code look ?
>> Are there any RRD operators that would allow me to loop thru both RRDs
>> like the above procedural code ?
>> I can't find any RRD operators nor any code fragments that would allow me
>> to do this.
>>
>> Thing is: by that time I composed RRD(A), this RRD would have contain
>> elements in array B as well as array A.
>> Same argument for RRD(B).
>>
>> Any pointers much appreciated.
>>
>> Thanks.
>>
>>
>> --
>>
>> -eric ho
>>
>>


-- 

-eric ho


How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
I've nested foreach loops like this:

  for i in A[i] do:
for j in B[j] do:
  append B[j] to some list if B[j] 'matches' A[i] in some fashion.

Each element in A or B is some complex structure like:
(
  some complex JSON,
  some number
)

Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)), how
would my code look ?
Are there any RRD operators that would allow me to loop thru both RRDs like
the above procedural code ?
I can't find any RRD operators nor any code fragments that would allow me
to do this.

Thing is: by that time I composed RRD(A), this RRD would have contain
elements in array B as well as array A.
Same argument for RRD(B).

Any pointers much appreciated.

Thanks.


-- 

-eric ho


how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
Hi,

I've two nested-for loops like this:

*for all elements in Array A do:*

*for all elements in Array B do:*

*compare a[3] with b[4] see if they 'match' and if match, return that
element;*

If I were to represent Arrays A and B as 2 separate RDDs, how would my code
look like ?

I couldn't find any RDD functions that would do this for me efficiently. I
don't really want elements of RDD(A) and RDD(B) flying all over the network
piecemeal...
THanks.

-- 

-eric ho


com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??

2015-05-04 Thread Eric Ho
Can I specify this in my build file ?

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/com-datastax-spark-spark-streaming-2-10-1-1-0-in-my-build-sbt-tp22758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Eric Ho
I'm submitting this via 'dse spark-submit' but somehow, I don't see any
loggings in my cluster or worker machines...

How can I find out ?

My cluster is running DSE 4.6.1 with Spark enabled.
My source is running Kafka 0.8.2.0

I'm launching my program on one of my DSE machines.

Any insights much appreciated.

Thanks.

-
cas1.dev% dse spark-submit --verbose --deploy-mode cluster --master
spark://cas1.dev.kno.com:7077 --class
com.kno.highlights.counting.service.HighlightConsumer --driver-class-path
/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib
--driver-library-path
/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib
--properties-file /tmp/highlights-counting.properties
/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib/kno-highlights-counting-service.kno-highlights-counting-service-0.1.jar
--name HighlightConsumer
Using properties file: /tmp/highlights-counting.properties
Warning: Ignoring non-spark config property:
checkpoint_directory=checkpointForHighlights
Warning: Ignoring non-spark config property: zookeeper_port=2181
Warning: Ignoring non-spark config property:
default_num_of_cores_per_topic=1
Warning: Ignoring non-spark config property: num_of_concurrent_streams=2
Warning: Ignoring non-spark config property:
kafka_consumer_group=highlight_consumer_group
Warning: Ignoring non-spark config property: app_name=HighlightConsumer
Warning: Ignoring non-spark config property: cassandra_keyspace=bookevents
Warning: Ignoring non-spark config property: scheduler_mode=FIFO
Warning: Ignoring non-spark config property: highlight_topic=highlight_topic
Warning: Ignoring non-spark config property: cassandra_host=cas1.dev.kno.com
Warning: Ignoring non-spark config property: checkpoint_interval=3
Warning: Ignoring non-spark config property: zookeeper_host=cas1.dev.kno.com
Adding default property: spark_master=spark://cas1.dev.kno.com:7077
Warning: Ignoring non-spark config property: streaming_window=10
Using properties file: /tmp/highlights-counting.properties
Warning: Ignoring non-spark config property:
checkpoint_directory=checkpointForHighlights
Warning: Ignoring non-spark config property: zookeeper_port=2181
Warning: Ignoring non-spark config property:
default_num_of_cores_per_topic=1
Warning: Ignoring non-spark config property: num_of_concurrent_streams=2
Warning: Ignoring non-spark config property:
kafka_consumer_group=highlight_consumer_group
Warning: Ignoring non-spark config property: app_name=HighlightConsumer
Warning: Ignoring non-spark config property: cassandra_keyspace=bookevents
Warning: Ignoring non-spark config property: scheduler_mode=FIFO
Warning: Ignoring non-spark config property: highlight_topic=highlight_topic
Warning: Ignoring non-spark config property: cassandra_host=cas1.dev.kno.com
Warning: Ignoring non-spark config property: checkpoint_interval=3
Warning: Ignoring non-spark config property: zookeeper_host=cas1.dev.kno.com
Adding default property: spark_master=spark://cas1.dev.kno.com:7077
Warning: Ignoring non-spark config property: streaming_window=10
Parsed arguments:
  master  spark://cas1.dev.kno.com:7077
  deployMode  cluster
  executorMemory  null
  executorCores   null
  totalExecutorCores  null
  propertiesFile  /tmp/highlights-counting.properties
  extraSparkPropertiesMap()
  driverMemorynull
  driverCores null
  driverExtraClassPath   
/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib
  driverExtraLibraryPath 
/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass  
com.kno.highlights.counting.service.HighlightConsumer
  primaryResource
file:/opt/kno/kno-highlights-counting-service/kno-highlights-counting-service-0.1/lib/kno-highlights-counting-service.kno-highlights-counting-service-0.1.jar
  name   
com.kno.highlights.counting.service.HighlightConsumer
  childArgs   [--name HighlightConsumer]
  jarsnull
  verbose true

Default properties from /tmp/highlights-counting.properties:
  spark_master - spark://cas1.dev.kno.com:7077


Using properties file: /tmp/highlights-counting.properties
Warning: Ignoring non-spark config property:
checkpoint_directory=checkpointForHighlights
Warning: Ignoring non-spark config property: zookeeper_port=2181
Warning: Ignoring non-spark config property:
default_num_of_cores_per_topic=1
Warning: Ignoring non-spark config property: num_of_concurrent_streams=2
Warning: Ignoring non-spark config property:
kafka_consumer_group=highlight_consumer_group
Warning: Ignoring non-spark config property: