Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread nguyen duc tuan
Try to use Dataframe instead of RDD. Here's an introduction to Dataframe: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html 2016-05-06 21:52 GMT+07:00 pratik gawande : > Thanks Shao for quick reply. I will look

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread pratik gawande
Thanks Shao for quick reply. I will look into how pyspark jobs are executed. Any suggestions or reference to docs on how to tune pyspark jobs? On Thu, May 5, 2016 at 10:12 PM -0700, "Saisai Shao" > wrote: Writing RDD based application

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread Saisai Shao
Writing RDD based application using pyspark will bring in additional overheads, Spark is running on the JVM whereas your python code is running on python runtime, so data should be communicated between JVM world and python world, this requires additional serialization-deserialization, IPC. Also

Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread pratik gawande
Hello, I am new to spark. For one of job I am finding significant performance difference when run in pyspark vs scala. Could you please let me know if this is known and scala is preferred over python for writing spark jobs? Also DAG visualization shows completely different DAGs for scala and