Hi there,

we have a small Spark cluster running and are processing around 40 GB of 
Gzip-compressed JSON data per day. I have written a couple of word count-like 
Scala jobs that essentially pull in all the data, do some joins, group bys and 
aggregations. A job takes around 40 minutes to complete.

Now one of the data scientists on the team wants to do write some jobs using 
Python. To learn Spark, he rewrote one of my Scala jobs in Python. From the 
API-side, everything looks more or less identical. However his jobs take 
between 5-8 hours to complete! We can also see that the execution plan is quite 
different, I’m seeing writes to the output much later than in Scala.

Is Python I/O really that slow?


Thanks
- Marius


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to