spark-streaming stopping

2017-03-12 Thread sathyanarayanan mudhaliyar
I am not able to stop Spark-streaming job. Let me explain briefly * getting data from Kafka topic * splitting data to create a JavaRDD * mapping the JavaRDD to JavaPairRDD to do a reduceByKey transformation * writing the JavaPairRDD into the C* DB // something going wrong here the message

Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-12 Thread Frank Astier
(this was also posted to stackoverflow on 03/10) I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the

Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin
I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-12 Thread Li Jin
Yeoul, I think a you can run an microbench for pyspark serialization/deserialization would be to run a withColumn + a python udf that returns a constant and compare that with similar code in Scala. I am not sure if there is way to measure just the serialization code, because pyspark API only

Re: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

2017-03-12 Thread Lysiane Bouchard
Hi, The error message indicates that a Streaming Context object end up in the fields of the closure that Spark tries to serialize. Could you show us the enclosing function and component ? The workarounds proposed in the following stack overflow reply might help you to fix the problem:

Spark thriff server hiveStatement.getQueryLog return empty

2017-03-12 Thread 李斌松
Spark thriff server hiveStatement.getQueryLog return empty?