RE: Scala Vs Python

AssafMendelson Sun, 04 Sep 2016 03:19:00 -0700

I don’t have anything off the hand (Unfortunately I didn’t really save it) but 
you can easily make some toy examples.
For example you might do something like defining a simple UDF (e.g. test if 
number < 10)
Then create the function in scala:


package com.example
import org.apache.spark.sql.functions.udf

object udfObj extends Serializable {
  def createUDF = {
    udf((x: Int) => x < 10)
  }
}

Compile the scala and run pyspark with --jars --driver-class-path on the 
created jar.
Inside pyspark do something like:


from py4j.java_gateway import java_import

from pyspark.sql.column import Column

from pyspark.sql.functions import udf

from pyspark.sql.types import BooleanType

import time



jvm = sc._gateway.jvm

java_import(jvm, "com.example")

def udf_scala(col):

    return Column(jvm.com.example.udfObj.createUDF().apply(col))



udf_python = udf(lambda x: x<10, BooleanType())



df = spark.range(10000000)

df.cache()

df.count()



df1 = df.filter(df.id < 10)

df2 = df.filter(udf_scala(df.id))

df3 = df.filter(udf_python(df.id))



t1 = time.time()

df1.count()

t2 = time.time()

df2.count()

t3 = time.time()

df3.count()

t4 = time.time()



print “time for builtin “ + str(t2-t1)

print “time for scala “ + str(t3-t2)

print “time for python “  + str(t4-t3)






The differences between the times should give you how long it takes (note the 
caching is done in order to make sure we don’t have issues where the range is 
created once and then reused) .
BTW, I saw this can be very touchy in terms of the cluster and its 
configuration. I ran it on two different cluster configurations and ran it 
several times to get some idea on the noise.
Of course, the more complicated the UDF, the less the overhead affects you.
Hope this helps.
                Assaf









From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Sunday, September 04, 2016 11:00 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: Scala Vs Python

Hi

This one is quite interesting. Is it possible to share few toy examples?

On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
I am not aware of any official testing but you can easily create your own.
In testing I made I saw that python UDF were more than 10 times slower than 
scala UDF (and in some cases it was closer to 50 times slower).
That said, it would depend on how you use your UDF.
For example, lets say you have a 1 billion row table which you do some 
aggregation on and left with a 10K rows table. If you do the python UDF in the 
beginning then it might have a hard hit but if you do it on the 10K rows table 
then the overhead might be negligible.
Furthermore, you can always write the UDF in scala and wrap it.
This is something my team did. We have data scientists working on spark in 
python. Normally, they can use the existing functions to do what they need 
(Spark already has a pretty nice spread of functions which answer most of the 
common use cases). When they need a new UDF or UDAF they simply ask my team 
(which does the engineering) and we write them a scala one and then wrap it to 
be accessible from python.


From: ayan guha [mailto:[hidden 
email]<http://user/SendEmail.jtp?type=node&node=27650&i=0>]
Sent: Friday, September 02, 2016 12:21 AM
To: kant kodali
Cc: Mendelson, Assaf; user
Subject: Re: Scala Vs Python

Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the only 
construct I am operating on.  Dataframe supposed to be language-agnostic in 
terms of performance.  So why people think python is slower? is it because of 
using UDF? Any other reason?

Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.

@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=27650&i=1>> wrote:
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or 
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 
page essay on why that wouldn't work so great. you might be wondering why would 
spark have it then? well probably because its ease of use for ML (that would be 
my best guess).
[https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI]





On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden 
email]<http://user/SendEmail.jtp?type=node&node=27650&i=2> wrote:

I believe this would greatly depend on your use case and your familiarity with 
the languages.



In general, scala would have a much better performance than python and not all 
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.

Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.

Python does not have interfaces for UDAFs.



I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.





From: ayan guha [mailto:[hidden 
email]<http://user/SendEmail.jtp?type=node&node=27637&i=0>]
Sent: Thursday, September 01, 2016 5:03 AM
To: user
Subject: Scala Vs Python



Hi Users



Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?



I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?



I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.



Or, if you think it is a moot point, please say so as well.



Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)



--

Best Regards,
Ayan Guha

________________________________
View this message in context: RE: Scala Vs 
Python<http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.



--
Best Regards,
Ayan Guha

________________________________
View this message in context: RE: Scala Vs 
Python<http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.



--
Best Regards,
Ayan Guha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Scala Vs Python

Reply via email to