Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs. As I work with multiple clients, my exp is org

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'? I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. Regards Gourav

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science. I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction. On Sat, 10 Oct 2020 at 13:03, Mich

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
Many thanks everyone for their valuable contribution. We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka
I would not leave it to data scientists unless they will maintain it. The key decision in cases I've seen was usually people cost/availability with ETL operations cost taken into account. Often the situation is that ETL cloud cost is small and you will not save much. Then it is just skills

Meaning of terms spark as computing engine and spark cluster

2020-10-10 Thread Santosh74
Is spark compute engine only or it's also cluster which comes with set of hardware /nodes ? What exactly is spark clusterr? Dear Experts please help -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Spark as computing engine vs spark cluster

2020-10-10 Thread Santosh74
Is spark compute engine only or it's also cluster which comes with set of hardware /nodes ? What exactly is spark clusterr? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
It really depends on what your data scientists talk. I don’t think it makes sense for ad hoc data science things to impose a language on them, but let them choose. For more complex AI engineering things you can though apply different standards and criteria. And then it really depends on

Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free to file a JIRA for that. On Wed, 7 Oct 2020, 22:44 Jacek Pliszka, wrote: > Hi! > > Is there any place I can find information how to use gapply with arrow? > > I've tried something very simple > > collect(gapply( > df,

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
Hey Mich, This is a very fair question .. I've seen many data engineering teams start out with Scala because technically it is the best choice for many given reasons and basically it is what Spark is. On the other hand, almost all use cases we see these days are data science use cases where

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
What is the use case? Unless you have unlimited funding and time to waste you would usually start with that. Regards, Gourav On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer wrote: > Spark in Scala (or java) Is much more performant if you are using RDD's, > those operations basically force you