thanks for quick reply, I will check the link. Hopefully, with conversion to py3, or 3.4 we could take advantage of asyncio and other cool new stuff ...
On Thu, Jan 29, 2015 at 7:41 PM, Reynold Xin <r...@databricks.com> wrote: > It is something like this: > https://issues.apache.org/jira/browse/SPARK-5097 > > On the master branch, we have a Pandas like API already. > > > On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski <skacan...@gmail.com> > wrote: > >> Hi Reynold, >> In my project I want to use Python API too. >> When you mention DF's are we talking about pandas or this is something >> internal to spark py api. >> If you could elaborate a bit on this or point me to alternate >> documentation. >> Thanks much --sasha >> >> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Once the data frame API is released for 1.3, you can write your thing in >>> Python and get the same performance. It can't express everything, but for >>> basic things like projection, filter, join, aggregate and simple numeric >>> computation, it should work pretty well. >>> >>> >>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow < >>> pastuszka.przemys...@gmail.com> >>> wrote: >>> >>> > Hi, >>> > >>> > In my company, we've been trying to use PySpark to run ETLs on our >>> data. >>> > Alas, it turned out to be terribly slow compared to Java or Scala API >>> > (which >>> > we ended up using to meet performance criteria). >>> > >>> > To be more quantitative, let's consider simple case: >>> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/ >>> > >>> > and tried to run simple computation on it, which includes three steps: >>> read >>> > -> multiply each row by 2 -> take max >>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/ >>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/ >>> > >>> > Here are the results of this simple benchmark: >>> > CPython - 59s >>> > PyPy - 26s >>> > Scala version - 7s >>> > >>> > I didn't dig into what exactly contributes to execution times of >>> CPython / >>> > PyPy, but it seems that serialization / deserialization, when sending >>> data >>> > to the worker may be the issue. >>> > I know some guys already have been asking about using Jython >>> > ( >>> > >>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658 >>> > , >>> > >>> > >>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html >>> > ), >>> > but it seems, that no one have really done this with Spark. >>> > It looks like performance gain from using jython can be huge - you >>> wouldn't >>> > need to spawn PythonWorkers, all the code would be just executed inside >>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you >>> > think >>> > that's possible to achieve? Do you see any obvious obstacles? Of >>> course, >>> > jython doesn't have C extensions, but if one doesn't need them, then it >>> > should fit here nicely. >>> > >>> > I'm willing to try to marry Spark with Jython and see how it goes. >>> > >>> > What do you think about this? >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> > >>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html >>> > Sent from the Apache Spark Developers List mailing list archive at >>> > Nabble.com. >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: dev-h...@spark.apache.org >>> > >>> > >>> >> >> >> >> -- >> Aleksandar Kacanski >> > > -- Aleksandar Kacanski