Re: Scala vs Python for ETL with Spark

Mich Talebzadeh Sun, 11 Oct 2020 01:41:32 -0700

Thanks Ayan.

I am not qualified to answer your first point. However, my experience with
Spark with Scala or Spark with Python agrees with your assertion that use
cases do not come into it. Most DEV/OPS work dealing with ETL are provided
by service companies that have workforce very familiar with Java,.
IntelliJ, Maven and latterly with Scala. Scala is their first choice where
they create Uber Jar files with IntelliJ and MVN on MacBook and shift them
into sandboxes for continuous tests. I believe this will remain a trend for
sometime as considerable investment is already made there. Then I came
across another consultancy tasked with getting raw files from S3 and
putting them into Snowflake. They wanted to use Spark with Python. So your
mileage varies.



Cheers,


Mich



On Sun, 11 Oct 2020 at 02:41, ayan guha <guha.a...@gmail.com> wrote:

> I have one observation: is "python udf is slow due to deserialization
> penulty" still relevant? Even after arrow is used as in memory data mgmt
> and so heavy investment from spark dev community on making pandas first
> class citizen including Udfs.
>
> As I work with multiple clients, my exp is org culture and available
> people are most imp driver for this choice regardless the use case. Use
> case is relevant only when there is a feature imparity
>
> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Not quite sure how meaningful this discussion is, but in case someone is
>> really faced with this query the question still is 'what is the use case'?
>> I am just a bit confused with the one size fits all deterministic
>> approach here thought that those days were over almost 10 years ago.
>> Regards
>> Gourav
>>
>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com> wrote:
>>
>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>> Science.    I wrote pipelines/frameworks for large companies and scala was
>>> a much better choice. But for ad-hoc work interfacing directly with data
>>> science experiments pyspark presents less friction.
>>>
>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Many thanks everyone for their valuable contribution.
>>>>
>>>> We all started with Spark a few years ago where Scala was the talk
>>>> of the town. I agree with the note that as long as Spark stayed nish and
>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>> fairness in 2014-2015, there was not much talk of Data Science input (I may
>>>> be wrong). But the world has moved on so to speak. Python itself has been
>>>> around a long time (long being relative here). Most people either knew UNIX
>>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>>> director a few years ago who asked our Hadoop admin for root password to
>>>> log in to the edge node. Later he became head of machine learning
>>>> somewhere else and he loved C and Python. So Python was a gift in disguise.
>>>> I think Python appeals to those who are very familiar with CLI and shell
>>>> programming (Not GUI fan). As some members alluded to there are more people
>>>> around with Python knowledge. Most managers choose Python as the unifying
>>>> development tool because they feel comfortable with it. Frankly I have not
>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>> disappointing to abandon Scala and switch to Python just for the sake of 
>>>> it.
>>>>
>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have come across occasions when the teams use Python with Spark for
>>>>> ETL, for example processing data from S3 buckets into Snowflake with 
>>>>> Spark.
>>>>>
>>>>> The only reason I think they are choosing Python as opposed to Scala
>>>>> is because they are more familiar with Python. Since Spark is written in
>>>>> Scala, itself is an indication of why I think Scala has an edge.
>>>>>
>>>>> I have not done one to one comparison of Spark with Scala vs Spark
>>>>> with Python. I understand for data science purposes most libraries like
>>>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>>>> validity of using Python with Spark for ETL purposes.
>>>>>
>>>>> These are my understanding but they are not facts so I would like to
>>>>> get some informed views on this if I can?
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>> --
> Best Regards,
> Ayan Guha
>

Re: Scala vs Python for ETL with Spark

Reply via email to