date:20170312

spark-streaming stopping

2017-03-12 Thread sathyanarayanan mudhaliyar

I am not able to stop Spark-streaming job. Let me explain briefly * getting data from Kafka topic * splitting data to create a JavaRDD * mapping the JavaRDD to JavaPairRDD to do a reduceByKey transformation * writing the JavaPairRDD into the C* DB // something going wrong here the message

Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-12 Thread Frank Astier

(this was also posted to stackoverflow on 03/10) I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the

Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin

I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-12 Thread Li Jin

Yeoul, I think a you can run an microbench for pyspark serialization/deserialization would be to run a withColumn + a python udf that returns a constant and compare that with similar code in Scala. I am not sure if there is way to measure just the serialization code, because pyspark API only

Re: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

2017-03-12 Thread Lysiane Bouchard

Hi, The error message indicates that a Streaming Context object end up in the fields of the closure that Spark tries to serialize. Could you show us the enclosing function and component ? The workarounds proposed in the following stack overflow reply might help you to fix the problem:

Spark thriff server hiveStatement.getQueryLog return empty

2017-03-12 Thread 李斌松

Spark thriff server hiveStatement.getQueryLog return empty?

spark-streaming stopping

Differences between scikit-learn and Spark.ml for regression toy problem

Re: Spark join over sorted columns of dataset.

Re: PySpark Serialization/Deserialization (Pickling) Overhead

Re: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

Spark thriff server hiveStatement.getQueryLog return empty

6 matches

Site Navigation

Mail list logo

Footer information