How to increase the parallelism of Spark Streaming application?

2018-11-06 Thread JF Chen
I have a Spark Streaming application which reads data from kafka and save the the transformation result to hdfs. My original partition number of kafka topic is 8, and repartition the data to 100 to increase the parallelism of spark job. Now I am wondering if I increase the kafka partition number

SPARK-25959 - Difference in featureImportances results on computed vs saved models

2018-11-06 Thread Suraj Nayak
Hi Spark Users, I tried to implement GBT and found that the feature Importance computed while the model was fit is different when the same model was saved into a storage and loaded back. I also found that once the persistent model is loaded and saved back again and loaded, the feature

Re: Shuffle write explosion

2018-11-06 Thread Yichen Zhou
Hi Dillon, Thank you for your reply. mapToPair use a PairFunction to transform format to a particular parquet format. I have tried to replace the mapToPair() function with other action operators like count() or collect(), but it didn't work. So I guess the shuffle write explosion problem have no

Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Bartosz Konieczny
Hi Matei, Thanks for your answer, it's much clearer now. I was not aware about the time needed for the release preparation. Best regards, Bartosz. On Tue, Nov 6, 2018 at 9:05 AM Matei Zaharia wrote: > Hi Bartosz, > > This is because the vote on 2.4 has passed (you can see the vote thread on >

Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Matei Zaharia
Hi Bartosz, This is because the vote on 2.4 has passed (you can see the vote thread on the dev mailing list) and we are just working to get the release into various channels (Maven, PyPI, etc), which can take some time. Expect to see an announcement soon once that’s done. Matei > On Nov 4,