Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.

If so try ujson [1] - a C implementation that is at least an order of
magnitude faster.

HTH

[1] https://pypi.python.org/pypi/ujson

2014-10-22 16:51 GMT+02:00 Marius Soutier <mps....@gmail.com>:

> It’s an AWS cluster that is rather small at the moment, 4 worker nodes @
> 28 GB RAM and 4 cores, but fast enough for the currently 40 Gigs a day.
> Data is on HDFS in EBS volumes. Each file is a Gzip-compress collection of
> JSON objects, each one between 115-120 MB to be near the HDFS block size.
>
> One core per worker is permanently used by a job that allows SQL queries
> over Parquet files.
>
> On 22.10.2014, at 16:18, Arian Pasquali <ar...@arianpasquali.com> wrote:
>
> Interesting thread Marius,
> Btw, I'm curious about your cluster size.
> How small it is in terms of ram and cores.
>
> Arian
>
> 2014-10-22 13:17 GMT+01:00 Nicholas Chammas <nicholas.cham...@gmail.com>:
>
>> Total guess without knowing anything about your code: Do either of these
>> two notes from the 1.1.0 release notes
>> <http://spark.apache.org/releases/spark-release-1-1-0.html> affect
>> things at all?
>>
>>
>>    - PySpark now performs external spilling during aggregations. Old
>>    behavior can be restored by setting spark.shuffle.spill to false.
>>    - PySpark uses a new heuristic for determining the parallelism of
>>    shuffle operations. Old behavior can be restored by setting
>>    spark.default.parallelism to the number of cores in the cluster.
>>
>> Nick
>> ​
>>
>> On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier <mps....@gmail.com>
>> wrote:
>>
>>> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but
>>> not that...
>>>
>>> On 22.10.2014, at 13:02, Nicholas Chammas <nicholas.cham...@gmail.com>
>>> wrote:
>>>
>>> What version of Spark are you running? Some recent changes
>>> <https://spark.apache.org/releases/spark-release-1-1-0.html> to how
>>> PySpark works relative to Scala Spark may explain things.
>>>
>>> PySpark should not be that much slower, not by a stretch.
>>>
>>> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab <as...@live.com> wrote:
>>>
>>>> I'm no expert, but looked into how the python bits work a while back
>>>> (was trying to assess what it would take to add F# support). It seems
>>>> python hosts a jvm inside of it, and talks to "scala spark" in that jvm.
>>>> The python server bit "translates" the python calls to those in the jvm.
>>>> The python spark context is like an adapter to the jvm spark context. If
>>>> you're seeing performance discrepancies, this might be the reason why. If
>>>> the code can be organised to require fewer interactions with the adapter,
>>>> that may improve things. Take this with a pinch of salt...I might be way
>>>> off on this :)
>>>>
>>>> Cheers,
>>>> Ashic.
>>>>
>>>> > From: mps....@gmail.com
>>>> > Subject: Python vs Scala performance
>>>> > Date: Wed, 22 Oct 2014 12:00:41 +0200
>>>> > To: user@spark.apache.org
>>>>
>>>> >
>>>> > Hi there,
>>>> >
>>>> > we have a small Spark cluster running and are processing around 40 GB
>>>> of Gzip-compressed JSON data per day. I have written a couple of word
>>>> count-like Scala jobs that essentially pull in all the data, do some joins,
>>>> group bys and aggregations. A job takes around 40 minutes to complete.
>>>> >
>>>> > Now one of the data scientists on the team wants to do write some
>>>> jobs using Python. To learn Spark, he rewrote one of my Scala jobs in
>>>> Python. From the API-side, everything looks more or less identical. However
>>>> his jobs take between 5-8 hours to complete! We can also see that the
>>>> execution plan is quite different, I’m seeing writes to the output much
>>>> later than in Scala.
>>>> >
>>>> > Is Python I/O really that slow?
>>>> >
>>>> >
>>>> > Thanks
>>>> > - Marius
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>>
>>>
>>>
>>>
>>
>
>

Reply via email to