Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
Spark in Scala (or java) Is much more performant if you are using RDD's, those operations basically force you to pass lambdas, hit serialization between java and python types and yes hit the Global Interpreter Lock. But, none of those things apply to Data Frames which will generate Java code

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
Thanks So ignoring Python lambdas is it a matter of individuals familiarity with the language that is the most important factor? Also I have noticed that Spark document preferences have been switched from Scala to Python as the first example. However, some codes for example JDBC calls are the

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no

Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark. The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala,