Spark in Scala (or java) Is much more performant if you are using RDD's,
those operations basically force you to pass lambdas, hit serialization
between java and python types and yes hit the Global Interpreter Lock. But,
none of those things apply to Data Frames which will generate Java code
Thanks
So ignoring Python lambdas is it a matter of individuals familiarity with
the language that is the most important factor? Also I have noticed that
Spark document preferences have been switched from Scala to Python as the
first example. However, some codes for example JDBC calls are the
As long as you don't use python lambdas in your Spark job there should be
almost no difference between the Scala and Python dataframe code. Once you
introduce python lambdas you will hit some significant serialization
penalties as well as have to run actual work code in python. As long as no
I have come across occasions when the teams use Python with Spark for ETL,
for example processing data from S3 buckets into Snowflake with Spark.
The only reason I think they are choosing Python as opposed to Scala is
because they are more familiar with Python. Since Spark is written in
Scala,