Re: How to speed PySpark to match Scala/Java performance

Sasha Kacanski Thu, 29 Jan 2015 16:47:03 -0800

thanks for quick reply, I will check the link.
Hopefully, with conversion to py3, or 3.4 we could take advantage of
asyncio and other cool new stuff ...


On Thu, Jan 29, 2015 at 7:41 PM, Reynold Xin <r...@databricks.com> wrote:

> It is something like this:
> https://issues.apache.org/jira/browse/SPARK-5097
>
> On the master branch, we have a Pandas like API already.
>
>
> On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski <skacan...@gmail.com>
> wrote:
>
>> Hi Reynold,
>> In my project I want to use Python API too.
>> When you mention DF's are we talking about pandas or this is something
>> internal to spark py api.
>> If you could elaborate a bit on this or point me to alternate
>> documentation.
>> Thanks much --sasha
>>
>> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Once the data frame API is released for 1.3, you can write your thing in
>>> Python and get the same performance. It can't express everything, but for
>>> basic things like projection, filter, join, aggregate and simple numeric
>>> computation, it should work pretty well.
>>>
>>>
>>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>>> pastuszka.przemys...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > In my company, we've been trying to use PySpark to run ETLs on our
>>> data.
>>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>>> > (which
>>> > we ended up using to meet performance criteria).
>>> >
>>> > To be more quantitative, let's consider simple case:
>>> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>>> >
>>> > and tried to run simple computation on it, which includes three steps:
>>> read
>>> > -> multiply each row by 2 -> take max
>>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>>> >
>>> > Here are the results of this simple benchmark:
>>> > CPython - 59s
>>> > PyPy - 26s
>>> > Scala version - 7s
>>> >
>>> > I didn't dig into what exactly contributes to execution times of
>>> CPython /
>>> > PyPy, but it seems that serialization / deserialization, when sending
>>> data
>>> > to the worker may be the issue.
>>> > I know some guys already have been asking about using Jython
>>> > (
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>>> > ,
>>> >
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>>> > ),
>>> > but it seems, that no one have really done this with Spark.
>>> > It looks like performance gain from using jython can be huge - you
>>> wouldn't
>>> > need to spawn PythonWorkers, all the code would be just executed inside
>>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>>> > think
>>> > that's possible to achieve? Do you see any obvious obstacles? Of
>>> course,
>>> > jython doesn't have C extensions, but if one doesn't need them, then it
>>> > should fit here nicely.
>>> >
>>> > I'm willing to try to marry Spark with Jython and see how it goes.
>>> >
>>> > What do you think about this?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Aleksandar Kacanski
>>
>
>


-- 
Aleksandar Kacanski

Re: How to speed PySpark to match Scala/Java performance

Reply via email to