Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
Hey
 My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's
spark testing libs for CI  thus giving you `almost` same functionality as
Scala - I say almost as in Scala you have nice and descriptive funcspecs -

For me choice is based on expertise.having worked with teams which are 99%
python..the cost of retraining -or even hiring - is too big especially if
you have an existing project and aggressive deadlines
Plz feel free to object
Kind Regards

On Fri, Oct 23, 2020, 1:01 PM William R  wrote:

> It's really a very big discussion around Pyspark Vs Scala. I have little
> bit experience about how we can automate the CI/CD when it's a JVM based
> language.
> I would like to take this as an opportunity to understand the end-to-end
> CI/CD flow for Pyspark based ETL pipelines.
>
> Could someone please list down the steps how the pipeline automation works
> when it comes to Pyspark based pipelines in Production ?
>
> //William
>
> On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
> wim.vanleu...@highestpoint.biz> wrote:
>
>> I think Sean is right, but in your argumentation you mention that 
>> 'functionality
>> is sacrificed in favour of the availability of resources'. That's where I
>> disagree with you but agree with Sean. That is mostly not true.
>>
>> In your previous posts you also mentioned this . The only reason we
>> sometimes have to bail out to Scala is for performance with certain udfs
>>
>> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for the feedback Sean.
>>>
>>> Kind regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>>
 I don't find this trolling; I agree with the observation that 'the
 skills you have' are a valid and important determiner of what tools you
 pick.
 I disagree that you just have to pick the optimal tool for everything.
 Sounds good until that comes in contact with the real world.
 For Spark, Python vs Scala just doesn't matter a lot, especially if
 you're doing DataFrame operations. By design. So I can't see there being
 one answer to this.

 On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi Mich,
>
> this is turning into a troll now, can you please stop this?
>
> No one uses Scala where Python should be used, and no one uses Python
> where Scala should be used - it all depends on requirements. Everyone
> understands polyglot programming and how to use relevant technologies best
> to their advantage.
>
>
> Regards,
> Gourav Sengupta
>
>
>>>
>
> --
> Regards,
> William R
> +919037075164
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Hi Wim,


I think we are splitting the atom here but my inference to functionality
was based on:



   1.  Spark is written in Scala, so knowing Scala programming language
   helps coders navigate into the source code, if something does not function
   as expected.
   2. Given the framework using Python increases the probability for more
   issues and bugs because translation between these two different languages
   is difficult.
   3. Using Scala for Spark provides access to the latest features of the
   Spark framework as they are first available in Scala and then ported to
   Python.
   4. Some functionalities are not available in Python. I have seen this
   few times in Spark doc.

There is an interesting write-up on this, although it does on touch on
CI/CD aspects.


 Developing Apache Spark Applications: Scala vs. Python
<https://www.pluralsight.com/blog/software-development/scala-vs-python>


Regards,


Mich



On Fri, 23 Oct 2020 at 10:23, Wim Van Leuven 
wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> this is turning into a troll now, can you please stop this?
>>>>
>>>> No one uses Scala where Python should be used, and no one uses Python
>>>> where Scala should be used - it all depends on requirements. Everyone
>>>> understands polyglot programming and how to use relevant technologies best
>>>> to their advantage.
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>>>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little
bit experience about how we can automate the CI/CD when it's a JVM based
language.
I would like to take this as an opportunity to understand the end-to-end
CI/CD flow for Pyspark based ETL pipelines.

Could someone please list down the steps how the pipeline automation works
when it comes to Pyspark based pipelines in Production ?

//William

On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi Mich,

 this is turning into a troll now, can you please stop this?

 No one uses Scala where Python should be used, and no one uses Python
 where Scala should be used - it all depends on requirements. Everyone
 understands polyglot programming and how to use relevant technologies best
 to their advantage.


 Regards,
 Gourav Sengupta


>>

-- 
Regards,
William R
+919037075164


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that
'functionality
is sacrificed in favour of the availability of resources'. That's where I
disagree with you but agree with Sean. That is mostly not true.

In your previous posts you also mentioned this . The only reason we
sometimes have to bail out to Scala is for performance with certain udfs

On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
wrote:

> Thanks for the feedback Sean.
>
> Kind regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>
>> I don't find this trolling; I agree with the observation that 'the skills
>> you have' are a valid and important determiner of what tools you pick.
>> I disagree that you just have to pick the optimal tool for everything.
>> Sounds good until that comes in contact with the real world.
>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>> you're doing DataFrame operations. By design. So I can't see there being
>> one answer to this.
>>
>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> this is turning into a troll now, can you please stop this?
>>>
>>> No one uses Scala where Python should be used, and no one uses Python
>>> where Scala should be used - it all depends on requirements. Everyone
>>> understands polyglot programming and how to use relevant technologies best
>>> to their advantage.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Thanks for the feedback Sean.

Kind regards,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:

> I don't find this trolling; I agree with the observation that 'the skills
> you have' are a valid and important determiner of what tools you pick.
> I disagree that you just have to pick the optimal tool for everything.
> Sounds good until that comes in contact with the real world.
> For Spark, Python vs Scala just doesn't matter a lot, especially if you're
> doing DataFrame operations. By design. So I can't see there being one
> answer to this.
>
> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta 
> wrote:
>
>> Hi Mich,
>>
>> this is turning into a troll now, can you please stop this?
>>
>> No one uses Scala where Python should be used, and no one uses Python
>> where Scala should be used - it all depends on requirements. Everyone
>> understands polyglot programming and how to use relevant technologies best
>> to their advantage.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>



Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills
you have' are a valid and important determiner of what tools you pick.
I disagree that you just have to pick the optimal tool for everything.
Sounds good until that comes in contact with the real world.
For Spark, Python vs Scala just doesn't matter a lot, especially if you're
doing DataFrame operations. By design. So I can't see there being one
answer to this.

On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta 
wrote:

> Hi Mich,
>
> this is turning into a troll now, can you please stop this?
>
> No one uses Scala where Python should be used, and no one uses Python
> where Scala should be used - it all depends on requirements. Everyone
> understands polyglot programming and how to use relevant technologies best
> to their advantage.
>
>
> Regards,
> Gourav Sengupta
>
>
>>>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
Hi Mich,

this is turning into a troll now, can you please stop this?

No one uses Scala where Python should be used, and no one uses Python where
Scala should be used - it all depends on requirements. Everyone understands
polyglot programming and how to use relevant technologies best to their
advantage.


Regards,
Gourav Sengupta


On Thu, Oct 22, 2020 at 5:13 PM Mich Talebzadeh 
wrote:

> Today I had a discussion with a lead developer on a client site regarding
> Scala or PySpark. with Spark.
>
> They were not doing data science and reluctantly agreed that PySpark was
> used for ETL.
>
> In mitigation he mentioned that in his team he is the only one that is an
> expert on Scala (his words) and the rest are Python savvys.
>
> It shows again that at times functionality is sacrificed in favour of the
> availability of resources and reaffirms what some members were saying
> regarding the choice of the technology based on TCO, favouring Python over
> Spark.
>
> HTH,
>
> Mich
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Today I had a discussion with a lead developer on a client site regarding
Scala or PySpark. with Spark.

They were not doing data science and reluctantly agreed that PySpark was
used for ETL.

In mitigation he mentioned that in his team he is the only one that is an
expert on Scala (his words) and the rest are Python savvys.

It shows again that at times functionality is sacrificed in favour of the
availability of resources and reaffirms what some members were saying
regarding the choice of the technology based on TCO, favouring Python over
Spark.

HTH,

Mich

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
Holy war is a bit dramatic don't you think?  The difference between Scala
and Python will always be very relevant when choosing between Spark and
Pyspark. I wouldn't call it irrelevant to the original question.

br,

molotch

On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yur...@gmail.com> wrote:

> It seems that thread converted to holy war that has nothing to do with
> original question. If it is, it’s super disappointing
>
> Отправлено с iPhone
>
> > 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> >
> > I would say the pros and cons of Python vs Scala is both down to Spark,
> the
> > languages in themselves and what kind of data engineer you will get when
> you
> > try to hire for the different solutions.
> >
> > With Pyspark you get less functionality and increased complexity with the
> > py4j java interop compared to vanilla Spark. Why would you want that?
> Maybe
> > you want the Python ML tools and have a clear use case, then go for it.
> If
> > not, avoid the increased complexity and reduced functionality of Pyspark.
> >
> > Python vs Scala? Idiomatic Python is a lesson in bad programming
> > habits/ideas, there's no other way to put it. Do you really want
> programmers
> > enjoying coding i such a language hacking away at your system?
> >
> > Scala might be far from perfect with the plethora of ways to express
> > yourself. But Python < 3.5 is not fit for anything except simple
> scripting
> > IMO.
> >
> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
> like a
> > fine idea. Coding an entire ETL library including state management, the
> > whole kitchen including the sink, Scala everyday of the week.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
I'm sorry you were offended. I'm not an expert in Python and I wasn't
trying to attack you personally. It's just an opinion about what makes a
language better or worse, it's not the single source of truth. You don't
have to take offense. In the end its about context and what you're trying
to achieve under what circumstances.

I know a little about both programming and ETL. To say I know nothing is
taking it a bit far. I don't know everything worth to know, that's for sure
and goes without saying.

It's fine to love Python and good for you being able to write Python
programs wiping Java commercial stacks left and right. It's just my opinion
that mutable dynamically typed languages encourage/enforce bad habits.

The larger the application and team gets, the worse off you are (again just
an opinion). Not everyone agrees (just look at Pythons popularity) but it's
definitely a relevant aspect when deciding going Spark or Pyspark.


br,

molotch

On Sat, 17 Oct 2020 at 16:40, Sasha Kacanski  wrote:

> And you are an expert on python! Idiomatic...
> Please do everyone a favor and stop commenting on things you have no
> idea...
> I build ETL systems python that wiped java commercial stacks left and
> right. Pyspark was and is  and will be a second class citizen in spark
> world. That has nothing to do with python.
> And as far as scala is concerned good luck with it...
>
>
>
>
>
> On Sat, Oct 17, 2020, 8:53 AM Molotch  wrote:
>
>> I would say the pros and cons of Python vs Scala is both down to Spark,
>> the
>> languages in themselves and what kind of data engineer you will get when
>> you
>> try to hire for the different solutions.
>>
>> With Pyspark you get less functionality and increased complexity with the
>> py4j java interop compared to vanilla Spark. Why would you want that?
>> Maybe
>> you want the Python ML tools and have a clear use case, then go for it. If
>> not, avoid the increased complexity and reduced functionality of Pyspark.
>>
>> Python vs Scala? Idiomatic Python is a lesson in bad programming
>> habits/ideas, there's no other way to put it. Do you really want
>> programmers
>> enjoying coding i such a language hacking away at your system?
>>
>> Scala might be far from perfect with the plethora of ways to express
>> yourself. But Python < 3.5 is not fit for anything except simple scripting
>> IMO.
>>
>> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like
>> a
>> fine idea. Coding an entire ETL library including state management, the
>> whole kitchen including the sink, Scala everyday of the week.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark.  In my
experience with performance is super important you’ll end up needing to do
some of your work in the JVM, but in many situations what matters work is
what your team and company are familiar with and the ecosystem of tooling
for your domain.

Since that can change so much between people and projects I think arguing
about the one true language is likely to be unproductive.

We’re all here because we want Spark and more broadly open source data
tooling to succeed — let’s keep that in mind. There is far too much stress
in the world, and I know I’ve sometimes used word choices I regret
especially this year. Let’s all take the weekend to do something we enjoy
away from Spark :)

On Sat, Oct 17, 2020 at 7:58 AM "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yur...@gmail.com> wrote:

> It seems that thread converted to holy war that has nothing to do with
> original question. If it is, it’s super disappointing
>
> Отправлено с iPhone
>
> > 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> >
> > I would say the pros and cons of Python vs Scala is both down to Spark,
> the
> > languages in themselves and what kind of data engineer you will get when
> you
> > try to hire for the different solutions.
> >
> > With Pyspark you get less functionality and increased complexity with the
> > py4j java interop compared to vanilla Spark. Why would you want that?
> Maybe
> > you want the Python ML tools and have a clear use case, then go for it.
> If
> > not, avoid the increased complexity and reduced functionality of Pyspark.
> >
> > Python vs Scala? Idiomatic Python is a lesson in bad programming
> > habits/ideas, there's no other way to put it. Do you really want
> programmers
> > enjoying coding i such a language hacking away at your system?
> >
> > Scala might be far from perfect with the plethora of ways to express
> > yourself. But Python < 3.5 is not fit for anything except simple
> scripting
> > IMO.
> >
> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
> like a
> > fine idea. Coding an entire ETL library including state management, the
> > whole kitchen including the sink, Scala everyday of the week.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
It seems that thread converted to holy war that has nothing to do with original 
question. If it is, it’s super disappointing

Отправлено с iPhone

> 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> 
> I would say the pros and cons of Python vs Scala is both down to Spark, the
> languages in themselves and what kind of data engineer you will get when you
> try to hire for the different solutions. 
> 
> With Pyspark you get less functionality and increased complexity with the
> py4j java interop compared to vanilla Spark. Why would you want that? Maybe
> you want the Python ML tools and have a clear use case, then go for it. If
> not, avoid the increased complexity and reduced functionality of Pyspark.
> 
> Python vs Scala? Idiomatic Python is a lesson in bad programming
> habits/ideas, there's no other way to put it. Do you really want programmers
> enjoying coding i such a language hacking away at your system?
> 
> Scala might be far from perfect with the plethora of ways to express
> yourself. But Python < 3.5 is not fit for anything except simple scripting
> IMO.
> 
> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
> fine idea. Coding an entire ETL library including state management, the
> whole kitchen including the sink, Scala everyday of the week.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Sasha Kacanski
And you are an expert on python! Idiomatic...
Please do everyone a favor and stop commenting on things you have no idea...
I build ETL systems python that wiped java commercial stacks left and
right. Pyspark was and is  and will be a second class citizen in spark
world. That has nothing to do with python.
And as far as scala is concerned good luck with it...





On Sat, Oct 17, 2020, 8:53 AM Molotch  wrote:

> I would say the pros and cons of Python vs Scala is both down to Spark, the
> languages in themselves and what kind of data engineer you will get when
> you
> try to hire for the different solutions.
>
> With Pyspark you get less functionality and increased complexity with the
> py4j java interop compared to vanilla Spark. Why would you want that? Maybe
> you want the Python ML tools and have a clear use case, then go for it. If
> not, avoid the increased complexity and reduced functionality of Pyspark.
>
> Python vs Scala? Idiomatic Python is a lesson in bad programming
> habits/ideas, there's no other way to put it. Do you really want
> programmers
> enjoying coding i such a language hacking away at your system?
>
> Scala might be far from perfect with the plethora of ways to express
> yourself. But Python < 3.5 is not fit for anything except simple scripting
> IMO.
>
> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
> fine idea. Coding an entire ETL library including state management, the
> whole kitchen including the sink, Scala everyday of the week.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the
languages in themselves and what kind of data engineer you will get when you
try to hire for the different solutions. 

With Pyspark you get less functionality and increased complexity with the
py4j java interop compared to vanilla Spark. Why would you want that? Maybe
you want the Python ML tools and have a clear use case, then go for it. If
not, avoid the increased complexity and reduced functionality of Pyspark.

Python vs Scala? Idiomatic Python is a lesson in bad programming
habits/ideas, there's no other way to put it. Do you really want programmers
enjoying coding i such a language hacking away at your system?

Scala might be far from perfect with the plethora of ways to express
yourself. But Python < 3.5 is not fit for anything except simple scripting
IMO.

Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
fine idea. Coding an entire ETL library including state management, the
whole kitchen including the sink, Scala everyday of the week.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-15 Thread Mich Talebzadeh
Hi,

I spent a few days converting one of my Spark/Scala scripts to Python. It
was interesting but at times looked like trench war. There is a lot of
handy stuff in Scala like case classes for defining column headers etc that
don't seem to be available in Python (possibly my lack of in-depth Python
knowledge). However, Spark documents frequently state availability of
features to Scala and Java and not Python.

Looking around everything written for Spark using Python is a work-around.
I am not considering Python for data science as my focus has been on
using Python with Spark for ETL, I published a thread on this today with
two examples of the code written in Scala and Python respectively. OK I
admit Lambda functions in Python with map is a great feature but that is
all. The rest can be achieved better with Scala. So I buy the view that
people tend to use Python with Spark for ETL (because with great respect)
they cannot be bothered to pick up Scala (I trust I am not unkind). So that
is it. When I was converting the code I remembered that I do still use a
Nokia 8210 (21 years old technology) from time to time. Old, sturdy, long
battery life and very small. Compare that one with Iphone. That is a fair
comparison between Spark on Scala with Spark on Python :)

HTH











LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 20:46, Mich Talebzadeh 
wrote:

> Hi,
>
> With regard to your statement below
>
> ".technology choices are agnostic to use cases according to you"
>
> If I may say, I do not think that was the message implied. What was said
> was that in addition to "best technology fit" there are other factors
> "equally important" that need to be considered, when a company makes a
> decision on a given product use case.
>
> As others have stated, what technology stacks you choose may not be the
> best available technology but something that provides an adequate solution
> at a reasonable TCO. Case in point if Scala in a given use case is the best
> fit but at higher TCO (labour cost), then you may opt to use Python or
> another because you have those resources available in-house at lower cost
> and your Data Scientists are eager to invest in Python. Companies these
> days are very careful where to spend their technology dollars or just
> cancel the projects totally. From my experience, the following are
> crucial in deciding what to invest in
>
>
>- Total Cost of Ownership
>- Internal Supportability & OpIerability thus avoiding single point of
>failure
>- Maximum leverage, strategic as opposed to tactical (example is
>Python considered more of a strategic product or Scala)
>-  Agile and DevOps compatible
>- Cloud-ready, flexible, scale-out
>- Vendor support
>- Documentation
>- Minimal footprint
>
> I trust this answers your point.
>
>
> Mich
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta 
> wrote:
>
>> So Mich and rest,
>>
>> technology choices are agnostic to use cases according to you? This is
>> interesting, really interesting. Perhaps I stand corrected.
>>
>> Regards,
>> Gourav
>>
>> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> if we take Spark and its massive parallel processing and in-memory
>>> cache away, then one can argue anything can do the "ETL" job. just write
>>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>>> another often using JDBC connections. However, we all concur that may not
>>> be good enough with Big Data volumes. Generally speaking, there are two
>>> ways of making a process faster:
>>>
>>>
>>>1. Do more intelligent work by creating indexes, cubes etc thus
>>>reducing the processing time
>>>2. Throw hardware and memory at it using something like Spark
>>>multi-cluster with fully managed cloud service like Google Dataproc
>>>
>>>
>>> In general, one would see an order of magnitude performance gains.
>>>
>>>
>>> HTH,
>>>
>>>
>>> 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
Hi,

With regard to your statement below

".technology choices are agnostic to use cases according to you"

If I may say, I do not think that was the message implied. What was said
was that in addition to "best technology fit" there are other factors
"equally important" that need to be considered, when a company makes a
decision on a given product use case.

As others have stated, what technology stacks you choose may not be the
best available technology but something that provides an adequate solution
at a reasonable TCO. Case in point if Scala in a given use case is the best
fit but at higher TCO (labour cost), then you may opt to use Python or
another because you have those resources available in-house at lower cost
and your Data Scientists are eager to invest in Python. Companies these
days are very careful where to spend their technology dollars or just
cancel the projects totally. From my experience, the following are
crucial in deciding what to invest in


   - Total Cost of Ownership
   - Internal Supportability & OpIerability thus avoiding single point of
   failure
   - Maximum leverage, strategic as opposed to tactical (example is Python
   considered more of a strategic product or Scala)
   -  Agile and DevOps compatible
   - Cloud-ready, flexible, scale-out
   - Vendor support
   - Documentation
   - Minimal footprint

I trust this answers your point.


Mich


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta 
wrote:

> So Mich and rest,
>
> technology choices are agnostic to use cases according to you? This is
> interesting, really interesting. Perhaps I stand corrected.
>
> Regards,
> Gourav
>
> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh 
> wrote:
>
>> if we take Spark and its massive parallel processing and in-memory
>> cache away, then one can argue anything can do the "ETL" job. just write
>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>> another often using JDBC connections. However, we all concur that may not
>> be good enough with Big Data volumes. Generally speaking, there are two
>> ways of making a process faster:
>>
>>
>>1. Do more intelligent work by creating indexes, cubes etc thus
>>reducing the processing time
>>2. Throw hardware and memory at it using something like Spark
>>multi-cluster with fully managed cloud service like Google Dataproc
>>
>>
>> In general, one would see an order of magnitude performance gains.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:
>>
>>> But when you have fairly large volume of data that is where spark comes
>>> in the party. And I assume the requirement of using spark is already
>>> established in the original qs and the discussion is to use python vs
>>> scala/java.
>>>
>>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
>>> wrote:
>>>
 If org has folks that can do python seriously why then spark in the
 first place. You can do workflow on your own, streaming or batch or what
 ever you want.
 I would not do anything else aside from python, but that is me.

 On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:

> I have one observation: is "python udf is slow due to deserialization
> penulty" still relevant? Even after arrow is used as in memory data mgmt
> and so heavy investment from spark dev community on making pandas first
> class citizen including Udfs.
>
> As I work with multiple clients, my exp is org culture and available
> people are most imp driver for this choice regardless the use case. Use
> case is relevant only when there is a feature imparity
>
> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Not quite sure how meaningful this discussion is, but in case someone
>> is really faced with this query the question still is 'what is the use
>> case'?
>> I am just a bit confused with the one size fits all deterministic
>> approach here thought that those days were over almost 10 years ago.
>> Regards
>> Gourav
>>
>> On Sat, 10 Oct 2020, 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Gourav Sengupta
So Mich and rest,

technology choices are agnostic to use cases according to you? This is
interesting, really interesting. Perhaps I stand corrected.

Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh 
wrote:

> if we take Spark and its massive parallel processing and in-memory
> cache away, then one can argue anything can do the "ETL" job. just write
> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
> another often using JDBC connections. However, we all concur that may not
> be good enough with Big Data volumes. Generally speaking, there are two
> ways of making a process faster:
>
>
>1. Do more intelligent work by creating indexes, cubes etc thus
>reducing the processing time
>2. Throw hardware and memory at it using something like Spark
>multi-cluster with fully managed cloud service like Google Dataproc
>
>
> In general, one would see an order of magnitude performance gains.
>
>
> HTH,
>
>
> Mich
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:
>
>> But when you have fairly large volume of data that is where spark comes
>> in the party. And I assume the requirement of using spark is already
>> established in the original qs and the discussion is to use python vs
>> scala/java.
>>
>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
>> wrote:
>>
>>> If org has folks that can do python seriously why then spark in the
>>> first place. You can do workflow on your own, streaming or batch or what
>>> ever you want.
>>> I would not do anything else aside from python, but that is me.
>>>
>>> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>>>
 I have one observation: is "python udf is slow due to deserialization
 penulty" still relevant? Even after arrow is used as in memory data mgmt
 and so heavy investment from spark dev community on making pandas first
 class citizen including Udfs.

 As I work with multiple clients, my exp is org culture and available
 people are most imp driver for this choice regardless the use case. Use
 case is relevant only when there is a feature imparity

 On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Not quite sure how meaningful this discussion is, but in case someone
> is really faced with this query the question still is 'what is the use
> case'?
> I am just a bit confused with the one size fits all deterministic
> approach here thought that those days were over almost 10 years ago.
> Regards
> Gourav
>
> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>
>> I agree with Wim's assessment of data engineering / ETL vs Data
>> Science.I wrote pipelines/frameworks for large companies and scala 
>> was
>> a much better choice. But for ad-hoc work interfacing directly with data
>> science experiments pyspark presents less friction.
>>
>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Many thanks everyone for their valuable contribution.
>>>
>>> We all started with Spark a few years ago where Scala was the talk
>>> of the town. I agree with the note that as long as Spark stayed nish and
>>> elite, then someone with Scala knowledge was attracting premiums. In
>>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>>> may
>>> be wrong). But the world has moved on so to speak. Python itself has 
>>> been
>>> around a long time (long being relative here). Most people either knew 
>>> UNIX
>>> Shell, C, Python or Perl or a combination of all these. I recall we had 
>>> a
>>> director a few years ago who asked our Hadoop admin for root password to
>>> log in to the edge node. Later he became head of machine learning
>>> somewhere else and he loved C and Python. So Python was a gift in 
>>> disguise.
>>> I think Python appeals to those who are very familiar with CLI and shell
>>> programming (Not GUI fan). As some members alluded to there are more 
>>> people
>>> around with Python knowledge. Most managers choose Python as the 
>>> unifying
>>> development tool because they feel comfortable with it. Frankly I have 
>>> not
>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>> disappointing to abandon Scala and switch to Python just for the sake 
>>> of it.
>>>
>>> Disclaimer: These are opinions and not facts so to speak :)
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
if we take Spark and its massive parallel processing and in-memory
cache away, then one can argue anything can do the "ETL" job. just write
some Java/Scala/SQL/Perl/python to read data and write to from one DB to
another often using JDBC connections. However, we all concur that may not
be good enough with Big Data volumes. Generally speaking, there are two
ways of making a process faster:


   1. Do more intelligent work by creating indexes, cubes etc thus reducing
   the processing time
   2. Throw hardware and memory at it using something like Spark
   multi-cluster with fully managed cloud service like Google Dataproc


In general, one would see an order of magnitude performance gains.


HTH,


Mich



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:

> But when you have fairly large volume of data that is where spark comes in
> the party. And I assume the requirement of using spark is already
> established in the original qs and the discussion is to use python vs
> scala/java.
>
> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
> wrote:
>
>> If org has folks that can do python seriously why then spark in the first
>> place. You can do workflow on your own, streaming or batch or what ever you
>> want.
>> I would not do anything else aside from python, but that is me.
>>
>> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>>
>>> I have one observation: is "python udf is slow due to deserialization
>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>> and so heavy investment from spark dev community on making pandas first
>>> class citizen including Udfs.
>>>
>>> As I work with multiple clients, my exp is org culture and available
>>> people are most imp driver for this choice regardless the use case. Use
>>> case is relevant only when there is a feature imparity
>>>
>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Not quite sure how meaningful this discussion is, but in case someone
 is really faced with this query the question still is 'what is the use
 case'?
 I am just a bit confused with the one size fits all deterministic
 approach here thought that those days were over almost 10 years ago.
 Regards
 Gourav

 On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data
> Science.I wrote pipelines/frameworks for large companies and scala was
> a much better choice. But for ad-hoc work interfacing directly with data
> science experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk
>> of the town. I agree with the note that as long as Spark stayed nish and
>> elite, then someone with Scala knowledge was attracting premiums. In
>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>> may
>> be wrong). But the world has moved on so to speak. Python itself has been
>> around a long time (long being relative here). Most people either knew 
>> UNIX
>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>> director a few years ago who asked our Hadoop admin for root password to
>> log in to the edge node. Later he became head of machine learning
>> somewhere else and he loved C and Python. So Python was a gift in 
>> disguise.
>> I think Python appeals to those who are very familiar with CLI and shell
>> programming (Not GUI fan). As some members alluded to there are more 
>> people
>> around with Python knowledge. Most managers choose Python as the unifying
>> development tool because they feel comfortable with it. Frankly I have 
>> not
>> seen a manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of 
>> it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark
>>> for ETL, for example processing data from S3 buckets into Snowflake with
>>> Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala
>>> is because they are 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread ayan guha
But when you have fairly large volume of data that is where spark comes in
the party. And I assume the requirement of using spark is already
established in the original qs and the discussion is to use python vs
scala/java.

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski  wrote:

> If org has folks that can do python seriously why then spark in the first
> place. You can do workflow on your own, streaming or batch or what ever you
> want.
> I would not do anything else aside from python, but that is me.
>
> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>
>> I have one observation: is "python udf is slow due to deserialization
>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>> and so heavy investment from spark dev community on making pandas first
>> class citizen including Udfs.
>>
>> As I work with multiple clients, my exp is org culture and available
>> people are most imp driver for this choice regardless the use case. Use
>> case is relevant only when there is a feature imparity
>>
>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Not quite sure how meaningful this discussion is, but in case someone is
>>> really faced with this query the question still is 'what is the use case'?
>>> I am just a bit confused with the one size fits all deterministic
>>> approach here thought that those days were over almost 10 years ago.
>>> Regards
>>> Gourav
>>>
>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>>>
 I agree with Wim's assessment of data engineering / ETL vs Data
 Science.I wrote pipelines/frameworks for large companies and scala was
 a much better choice. But for ad-hoc work interfacing directly with data
 science experiments pyspark presents less friction.

 On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk
> of the town. I agree with the note that as long as Spark stayed nish and
> elite, then someone with Scala knowledge was attracting premiums. In
> fairness in 2014-2015, there was not much talk of Data Science input (I 
> may
> be wrong). But the world has moved on so to speak. Python itself has been
> around a long time (long being relative here). Most people either knew 
> UNIX
> Shell, C, Python or Perl or a combination of all these. I recall we had a
> director a few years ago who asked our Hadoop admin for root password to
> log in to the edge node. Later he became head of machine learning
> somewhere else and he loved C and Python. So Python was a gift in 
> disguise.
> I think Python appeals to those who are very familiar with CLI and shell
> programming (Not GUI fan). As some members alluded to there are more 
> people
> around with Python knowledge. Most managers choose Python as the unifying
> development tool because they feel comfortable with it. Frankly I have not
> seen a manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of 
> it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with 
>> Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala
>> is because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark
>> with Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to
>> get some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
Thanks Ayan.

I am not qualified to answer your first point. However, my experience with
Spark with Scala or Spark with Python agrees with your assertion that use
cases do not come into it. Most DEV/OPS work dealing with ETL are provided
by service companies that have workforce very familiar with Java,.
IntelliJ, Maven and latterly with Scala. Scala is their first choice where
they create Uber Jar files with IntelliJ and MVN on MacBook and shift them
into sandboxes for continuous tests. I believe this will remain a trend for
sometime as considerable investment is already made there. Then I came
across another consultancy tasked with getting raw files from S3 and
putting them into Snowflake. They wanted to use Spark with Python. So your
mileage varies.


Cheers,


Mich



On Sun, 11 Oct 2020 at 02:41, ayan guha  wrote:

> I have one observation: is "python udf is slow due to deserialization
> penulty" still relevant? Even after arrow is used as in memory data mgmt
> and so heavy investment from spark dev community on making pandas first
> class citizen including Udfs.
>
> As I work with multiple clients, my exp is org culture and available
> people are most imp driver for this choice regardless the use case. Use
> case is relevant only when there is a feature imparity
>
> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta 
> wrote:
>
>> Not quite sure how meaningful this discussion is, but in case someone is
>> really faced with this query the question still is 'what is the use case'?
>> I am just a bit confused with the one size fits all deterministic
>> approach here thought that those days were over almost 10 years ago.
>> Regards
>> Gourav
>>
>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>>
>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>> Science.I wrote pipelines/frameworks for large companies and scala was
>>> a much better choice. But for ad-hoc work interfacing directly with data
>>> science experiments pyspark presents less friction.
>>>
>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
>>> wrote:
>>>
 Many thanks everyone for their valuable contribution.

 We all started with Spark a few years ago where Scala was the talk
 of the town. I agree with the note that as long as Spark stayed nish and
 elite, then someone with Scala knowledge was attracting premiums. In
 fairness in 2014-2015, there was not much talk of Data Science input (I may
 be wrong). But the world has moved on so to speak. Python itself has been
 around a long time (long being relative here). Most people either knew UNIX
 Shell, C, Python or Perl or a combination of all these. I recall we had a
 director a few years ago who asked our Hadoop admin for root password to
 log in to the edge node. Later he became head of machine learning
 somewhere else and he loved C and Python. So Python was a gift in disguise.
 I think Python appeals to those who are very familiar with CLI and shell
 programming (Not GUI fan). As some members alluded to there are more people
 around with Python knowledge. Most managers choose Python as the unifying
 development tool because they feel comfortable with it. Frankly I have not
 seen a manager who feels at home with Scala. So in summary it is a bit
 disappointing to abandon Scala and switch to Python just for the sake of 
 it.

 Disclaimer: These are opinions and not facts so to speak :)

 Cheers,


 Mich






 On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
 wrote:

> I have come across occasions when the teams use Python with Spark for
> ETL, for example processing data from S3 buckets into Snowflake with 
> Spark.
>
> The only reason I think they are choosing Python as opposed to Scala
> is because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark
> with Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to
> get some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable 

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha
I have one observation: is "python udf is slow due to deserialization
penulty" still relevant? Even after arrow is used as in memory data mgmt
and so heavy investment from spark dev community on making pandas first
class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people
are most imp driver for this choice regardless the use case. Use case is
relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta 
wrote:

> Not quite sure how meaningful this discussion is, but in case someone is
> really faced with this query the question still is 'what is the use case'?
> I am just a bit confused with the one size fits all deterministic approach
> here thought that those days were over almost 10 years ago.
> Regards
> Gourav
>
> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>
>> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>>   I wrote pipelines/frameworks for large companies and scala was a much
>> better choice. But for ad-hoc work interfacing directly with data science
>> experiments pyspark presents less friction.
>>
>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
>> wrote:
>>
>>> Many thanks everyone for their valuable contribution.
>>>
>>> We all started with Spark a few years ago where Scala was the talk
>>> of the town. I agree with the note that as long as Spark stayed nish and
>>> elite, then someone with Scala knowledge was attracting premiums. In
>>> fairness in 2014-2015, there was not much talk of Data Science input (I may
>>> be wrong). But the world has moved on so to speak. Python itself has been
>>> around a long time (long being relative here). Most people either knew UNIX
>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>> director a few years ago who asked our Hadoop admin for root password to
>>> log in to the edge node. Later he became head of machine learning
>>> somewhere else and he loved C and Python. So Python was a gift in disguise.
>>> I think Python appeals to those who are very familiar with CLI and shell
>>> programming (Not GUI fan). As some members alluded to there are more people
>>> around with Python knowledge. Most managers choose Python as the unifying
>>> development tool because they feel comfortable with it. Frankly I have not
>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>>
>>> Disclaimer: These are opinions and not facts so to speak :)
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>>> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich




 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *





 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>> --
Best Regards,
Ayan Guha


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
Not quite sure how meaningful this discussion is, but in case someone is
really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach
here thought that those days were over almost 10 years ago.
Regards
Gourav

On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>   I wrote pipelines/frameworks for large companies and scala was a much
> better choice. But for ad-hoc work interfacing directly with data science
> experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk of the
>> town. I agree with the note that as long as Spark stayed nish and elite,
>> then someone with Scala knowledge was attracting premiums. In fairness in
>> 2014-2015, there was not much talk of Data Science input (I may be wrong).
>> But the world has moved on so to speak. Python itself has been around
>> a long time (long being relative here). Most people either knew UNIX Shell,
>> C, Python or Perl or a combination of all these. I recall we had a director
>> a few years ago who asked our Hadoop admin for root password to log in to
>> the edge node. Later he became head of machine learning somewhere else and
>> he loved C and Python. So Python was a gift in disguise. I think Python
>> appeals to those who are very familiar with CLI and shell programming (Not
>> GUI fan). As some members alluded to there are more people around with
>> Python knowledge. Most managers choose Python as the unifying development
>> tool because they feel comfortable with it. Frankly I have not seen a
>> manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk of the
> town. I agree with the note that as long as Spark stayed nish and elite,
> then someone with Scala knowledge was attracting premiums. In fairness in
> 2014-2015, there was not much talk of Data Science input (I may be wrong).
> But the world has moved on so to speak. Python itself has been around
> a long time (long being relative here). Most people either knew UNIX Shell,
> C, Python or Perl or a combination of all these. I recall we had a director
> a few years ago who asked our Hadoop admin for root password to log in to
> the edge node. Later he became head of machine learning somewhere else and
> he loved C and Python. So Python was a gift in disguise. I think Python
> appeals to those who are very familiar with CLI and shell programming (Not
> GUI fan). As some members alluded to there are more people around with
> Python knowledge. Most managers choose Python as the unifying development
> tool because they feel comfortable with it. Frankly I have not seen a
> manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the
town. I agree with the note that as long as Spark stayed nish and elite,
then someone with Scala knowledge was attracting premiums. In fairness in
2014-2015, there was not much talk of Data Science input (I may be wrong).
But the world has moved on so to speak. Python itself has been around
a long time (long being relative here). Most people either knew UNIX Shell,
C, Python or Perl or a combination of all these. I recall we had a director
a few years ago who asked our Hadoop admin for root password to log in to
the edge node. Later he became head of machine learning somewhere else and
he loved C and Python. So Python was a gift in disguise. I think Python
appeals to those who are very familiar with CLI and shell programming (Not
GUI fan). As some members alluded to there are more people around with
Python knowledge. Most managers choose Python as the unifying development
tool because they feel comfortable with it. Frankly I have not seen a
manager who feels at home with Scala. So in summary it is a bit
disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich






On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka
I would not leave it to data scientists unless they will maintain it.

The key decision in cases I've seen was usually people
cost/availability with ETL operations cost taken into account.

Often the situation is that ETL cloud cost is small and you will not
save much. Then it is just skills cost/availability.
For Python skills you pay less and you can pick people with other
useful skills and also you can more easily train people you have
internally.

Often you have some simple ETL scripts before moving to spark and
these scripts are usually written in Python.

Best Regards,

Jacek


sob., 10 paź 2020 o 12:32 Jörn Franke  napisał(a):
>
> It really depends on what your data scientists talk. I don’t think it makes 
> sense for ad hoc data science things to impose a language on them, but let 
> them choose.
> For more complex AI engineering things you can though apply different 
> standards and criteria. And then it really depends on architecture aspects 
> etc.
>
> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
>
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
It really depends on what your data scientists talk. I don’t think it makes 
sense for ad hoc data science things to impose a language on them, but let them 
choose.
For more complex AI engineering things you can though apply different standards 
and criteria. And then it really depends on architecture aspects etc.

> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
> 
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
> 
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
> 
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
> 
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
> 
> Many thanks,
> 
> Mich
> 
> 
> 
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
Hey Mich,

This is a very fair question .. I've seen many data engineering teams start
out with Scala because technically it is the best choice for many given
reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science
use cases where people mostly do python. So, if you need those two worlds
collaborate and even handover code, you don't want the ideological battle
of Scala vs Python. We chose python for the sake of everybody speaking the
same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin
layer around everything on the JVM. Even the discussion of Python UDFs
don't hold up. If it works as a Python function (and most of the time it
does) why do Scala? If however, performance characteristics show you
otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are
much more rare ... a build tool like Maven for Java or SBT for Scala don't
exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the
many data science use cases out there, I see many Spark shops prefer python
over Scala because Spark gravitates to dataframes where the downsides of
Python do not stack up. Performance of python as a driver program which is
just the glue code, becomes irrelevant compared to the processing you are
doing on the JVM. We even notice that Python is much easier and we hear
echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If
you only do data engineering, Scala can be the better choice. It depends on
the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> <https://www.kdnuggets.com/2018/05/apache-spark-python-scala.html#:~:text=Scala%20is%20frequently%20over%2010,languages%20are%20faster%20than%20interpreted.>
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
What is the use case?
Unless you have unlimited funding and time to waste you would usually start
with that.

Regards,
Gourav

On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer 
wrote:

> Spark in Scala (or java) Is much more performant if you are using RDD's,
> those operations basically force you to pass lambdas, hit serialization
> between java and python types and yes hit the Global Interpreter Lock. But,
> none of those things apply to Data Frames which will generate Java code
> regardless of what language you use to describe the Dataframe operations as
> long as you don't use python lambdas. A Dataframe operation without python
> lambdas should not require any remote python code execution.
>
> TLDR, If you are using Dataframes it doesn't matter if you use Scala,
> Java, Python, R, SQL, the planning and work will all happen in the JVM.
>
> As for a repl, you can run PySpark which will start up a repl. There are
> also a slew of notebooks which provide interactive python environments as
> well.
>
>
> On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh 
> wrote:
>
>> Thanks
>>
>> So ignoring Python lambdas is it a matter of individuals familiarity with
>> the language that is the most important factor? Also I have noticed that
>> Spark document preferences have been switched from Scala to Python as the
>> first example. However, some codes for example JDBC calls are the same for
>> Scala and Python.
>>
>> Some examples like this website
>> 
>> claim that Scala performance is an order of magnitude better than Python
>> and also when it comes to concurrency Scala is a better choice. Maybe it is
>> pretty old (2018)?
>>
>> Also (and may be my ignorance I have not researched it) does Spark offer
>> REPL in the form of spark-shell with Python?
>>
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
>> wrote:
>>
>>> As long as you don't use python lambdas in your Spark job there should
>>> be almost no difference between the Scala and Python dataframe code. Once
>>> you introduce python lambdas you will hit some significant serialization
>>> penalties as well as have to run actual work code in python. As long as no
>>> lambdas are used, everything will operate with Catalyst compiled java code
>>> so there won't be a big difference between python and scala.
>>>
>>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich




 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *





 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
Spark in Scala (or java) Is much more performant if you are using RDD's,
those operations basically force you to pass lambdas, hit serialization
between java and python types and yes hit the Global Interpreter Lock. But,
none of those things apply to Data Frames which will generate Java code
regardless of what language you use to describe the Dataframe operations as
long as you don't use python lambdas. A Dataframe operation without python
lambdas should not require any remote python code execution.

TLDR, If you are using Dataframes it doesn't matter if you use Scala, Java,
Python, R, SQL, the planning and work will all happen in the JVM.

As for a repl, you can run PySpark which will start up a repl. There are
also a slew of notebooks which provide interactive python environments as
well.


On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> 
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with
the language that is the most important factor? Also I have noticed that
Spark document preferences have been switched from Scala to Python as the
first example. However, some codes for example JDBC calls are the same for
Scala and Python.

Some examples like this website

claim that Scala performance is an order of magnitude better than Python
and also when it comes to concurrency Scala is a better choice. Maybe it is
pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer
REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
wrote:

> As long as you don't use python lambdas in your Spark job there should be
> almost no difference between the Scala and Python dataframe code. Once you
> introduce python lambdas you will hit some significant serialization
> penalties as well as have to run actual work code in python. As long as no
> lambdas are used, everything will operate with Catalyst compiled java code
> so there won't be a big difference between python and scala.
>
> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
As long as you don't use python lambdas in your Spark job there should be
almost no difference between the Scala and Python dataframe code. Once you
introduce python lambdas you will hit some significant serialization
penalties as well as have to run actual work code in python. As long as no
lambdas are used, everything will operate with Catalyst compiled java code
so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
I have come across occasions when the teams use Python with Spark for ETL,
for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is
because they are more familiar with Python. Since Spark is written in
Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with
Python. I understand for data science purposes most libraries like
TensorFlow etc. are written in Python but I am at loss to understand the
validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get
some informed views on this if I can?

Many thanks,

Mich




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Scala Vs Python

2016-09-06 Thread 刘虓
Hi,
I have been using spark-sql with python for more than one year from ver
1.5.0 to ver 2.0.0,
It works great so far,the performance is always great,though I have not
done the benchmark yet.
also I have skimmed through source code of python api,most of it only calls
scala api,nothing heavily is done using python.


2016-09-06 18:38 GMT+08:00 Leonard Cohen <3498363...@qq.com>:

> hi spark user,
>
> IMHO, I will use the language for application aligning with the language
> under which the system designed.
>
> If working on Spark, I choose Scala.
> If working on Hadoop, I choose Java.
> If working on nothing, I use Python.
> Why?
> Because it will save my life, just kidding.
>
>
> Best regards,
> Leonard
> -- Original --
> *From: * "Luciano Resende";<luckbr1...@gmail.com>;
> *Send time:* Tuesday, Sep 6, 2016 8:07 AM
> *To:* "darren"<dar...@ontrenet.com>;
> *Cc:* "Mich Talebzadeh"<mich.talebza...@gmail.com>; "Jakob Odersky"<
> ja...@odersky.com>; "ayan guha"<guha.a...@gmail.com>; "kant kodali"<
> kanth...@gmail.com>; "AssafMendelson"<assaf.mendel...@rsa.com>; "user"<
> user@spark.apache.org>;
> *Subject: * Re: Scala Vs Python
>
>
>
> On Thu, Sep 1, 2016 at 3:15 PM, darren <dar...@ontrenet.com> wrote:
>
>> This topic is a concern for us as well. In the data science world no one
>> uses native scala or java by choice. It's R and Python. And python is
>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>
>> This is why we have decoupled from spark in our project. It's really
>> unfortunate spark team have invested so heavily in scale.
>>
>> As for speed it comes from horizontal scaling and throughout. When you
>> can scale outward, individual VM performance is less an issue. Basic HPC
>> principles.
>>
>>
> You could still try to get best of the both worlds, having your data
> scientists writing their algorithms using Python and/or R and have a
> compiler/optimizer handling the optimizations to run in a distributed
> fashion in a spark cluster leveraging some of the low level apis written in
> java/scala. Take a look at Apache SystemML http://systemml.apache.org/
> for more details.
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Scala Vs Python

2016-09-06 Thread Leonard Cohen
hi spark user,


IMHO, I will use the language for application aligning with the language under 
which the system designed.


If working on Spark, I choose Scala.
If working on Hadoop, I choose Java.
If working on nothing, I use Python.
Why?
Because it will save my life, just kidding.




Best regards,
Leonard
-- Original --
From:  "Luciano Resende";<luckbr1...@gmail.com>;
Send time: Tuesday, Sep 6, 2016 8:07 AM
To: "darren"<dar...@ontrenet.com>; 
Cc: "Mich Talebzadeh"<mich.talebza...@gmail.com>; "Jakob 
Odersky"<ja...@odersky.com>; "ayan guha"<guha.a...@gmail.com>; "kant 
kodali"<kanth...@gmail.com>; "AssafMendelson"<assaf.mendel...@rsa.com>; 
"user"<user@spark.apache.org>; 
Subject:  Re: Scala Vs Python





On Thu, Sep 1, 2016 at 3:15 PM, darren <dar...@ontrenet.com> wrote:
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.


This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale. 


As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.





You could still try to get best of the both worlds, having your data scientists 
writing their algorithms using Python and/or R and have a compiler/optimizer 
handling the optimizations to run in a distributed fashion in a spark cluster 
leveraging some of the low level apis written in java/scala. Take a look at 
Apache SystemML http://systemml.apache.org/ for more details.




-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Scala Vs Python

2016-09-05 Thread Luciano Resende
On Thu, Sep 1, 2016 at 3:15 PM, darren  wrote:

> This topic is a concern for us as well. In the data science world no one
> uses native scala or java by choice. It's R and Python. And python is
> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>
> This is why we have decoupled from spark in our project. It's really
> unfortunate spark team have invested so heavily in scale.
>
> As for speed it comes from horizontal scaling and throughout. When you can
> scale outward, individual VM performance is less an issue. Basic HPC
> principles.
>
>
You could still try to get best of the both worlds, having your data
scientists writing their algorithms using Python and/or R and have a
compiler/optimizer handling the optimizations to run in a distributed
fashion in a spark cluster leveraging some of the low level apis written in
java/scala. Take a look at Apache SystemML http://systemml.apache.org/ for
more details.



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Scala Vs Python

2016-09-05 Thread Gourav Sengupta
The pertinent question is between "functional programming" and procedural
or OOPs.

I think when you are dealing with data solutions, functional programming is
a more natural way to think and work.


Regards,
Gourav

On Sun, Sep 4, 2016 at 11:17 AM, AssafMendelson <assaf.mendel...@rsa.com>
wrote:

> I don’t have anything off the hand (Unfortunately I didn’t really save it)
> but you can easily make some toy examples.
>
> For example you might do something like defining a simple UDF (e.g. test
> if number < 10)
>
> Then create the function in scala:
>
>
>
> package com.example
>
> import org.apache.spark.sql.functions.udf
>
>
>
> object udfObj extends Serializable {
>
>   def createUDF = {
>
> udf((x: Int) => x < 10)
>
>   }
>
> }
>
>
>
> Compile the scala and run pyspark with --jars --driver-class-path on the
> created jar.
>
> Inside pyspark do something like:
>
>
>
> from py4j.java_gateway import java_import
>
> from pyspark.sql.column import Column
>
> from pyspark.sql.functions import udf
>
> from pyspark.sql.types import BooleanType
>
> import time
>
>
>
> jvm = sc._gateway.jvm
>
> java_import(jvm, "com.example")
>
> def udf_scala(col):
>
> return Column(jvm.com.example.udfObj.createUDF().apply(col))
>
>
>
> udf_python = udf(lambda x: x<10, BooleanType())
>
>
>
> df = spark.range(1000)
>
> df.cache()
>
> df.count()
>
>
>
> df1 = df.filter(df.id < 10)
>
> df2 = df.filter(udf_scala(df.id))
>
> df3 = df.filter(udf_python(df.id))
>
>
>
> t1 = time.time()
>
> df1.count()
>
> t2 = time.time()
>
> df2.count()
>
> t3 = time.time()
>
> df3.count()
>
> t4 = time.time()
>
>
>
> print “time for builtin “ + str(t2-t1)
>
> print “time for scala “ + str(t3-t2)
>
> print “time for python “  + str(t4-t3)
>
>
>
>
>
>
>
> The differences between the times should give you how long it takes (note
> the caching is done in order to make sure we don’t have issues where the
> range is created once and then reused) .
>
> BTW, I saw this can be very touchy in terms of the cluster and its
> configuration. I ran it on two different cluster configurations and ran it
> several times to get some idea on the noise.
>
> Of course, the more complicated the UDF, the less the overhead affects you.
>
> Hope this helps.
>
> Assaf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node=27651=0>]
> *Sent:* Sunday, September 04, 2016 11:00 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: Scala Vs Python
>
>
>
> Hi
>
>
>
> This one is quite interesting. Is it possible to share few toy examples?
>
>
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node=27651=1>> wrote:
>
> I am not aware of any official testing but you can easily create your own.
>
> In testing I made I saw that python UDF were more than 10 times slower
> than scala UDF (and in some cases it was closer to 50 times slower).
>
> That said, it would depend on how you use your UDF.
>
> For example, lets say you have a 1 billion row table which you do some
> aggregation on and left with a 10K rows table. If you do the python UDF in
> the beginning then it might have a hard hit but if you do it on the 10K
> rows table then the overhead might be negligible.
>
> Furthermore, you can always write the UDF in scala and wrap it.
>
> This is something my team did. We have data scientists working on spark in
> python. Normally, they can use the existing functions to do what they need
> (Spark already has a pretty nice spread of functions which answer most of
> the common use cases). When they need a new UDF or UDAF they simply ask my
> team (which does the engineering) and we write them a scala one and then
> wrap it to be accessible from python.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http://user/SendEmail.jtp?type=node=27650=0>]
> *Sent:* Friday, September 02, 2016 12:21 AM
> *To:* kant kodali
> *Cc:* Mendelson, Assaf; user
> *Subject:* Re: Scala Vs Python
>
>
>
> Thanks All for your replies.
>
>
>
> Feature Parity:
>
>
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
>
>
> Perfor

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I don’t have anything off the hand (Unfortunately I didn’t really save it) but 
you can easily make some toy examples.
For example you might do something like defining a simple UDF (e.g. test if 
number < 10)
Then create the function in scala:

package com.example
import org.apache.spark.sql.functions.udf

object udfObj extends Serializable {
  def createUDF = {
udf((x: Int) => x < 10)
  }
}

Compile the scala and run pyspark with --jars --driver-class-path on the 
created jar.
Inside pyspark do something like:


from py4j.java_gateway import java_import

from pyspark.sql.column import Column

from pyspark.sql.functions import udf

from pyspark.sql.types import BooleanType

import time



jvm = sc._gateway.jvm

java_import(jvm, "com.example")

def udf_scala(col):

return Column(jvm.com.example.udfObj.createUDF().apply(col))



udf_python = udf(lambda x: x<10, BooleanType())



df = spark.range(1000)

df.cache()

df.count()



df1 = df.filter(df.id < 10)

df2 = df.filter(udf_scala(df.id))

df3 = df.filter(udf_python(df.id))



t1 = time.time()

df1.count()

t2 = time.time()

df2.count()

t3 = time.time()

df3.count()

t4 = time.time()



print “time for builtin “ + str(t2-t1)

print “time for scala “ + str(t3-t2)

print “time for python “  + str(t4-t3)






The differences between the times should give you how long it takes (note the 
caching is done in order to make sure we don’t have issues where the range is 
created once and then reused) .
BTW, I saw this can be very touchy in terms of the cluster and its 
configuration. I ran it on two different cluster configurations and ran it 
several times to get some idea on the noise.
Of course, the more complicated the UDF, the less the overhead affects you.
Hope this helps.
Assaf









From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Sunday, September 04, 2016 11:00 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: Scala Vs Python

Hi

This one is quite interesting. Is it possible to share few toy examples?

On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
I am not aware of any official testing but you can easily create your own.
In testing I made I saw that python UDF were more than 10 times slower than 
scala UDF (and in some cases it was closer to 50 times slower).
That said, it would depend on how you use your UDF.
For example, lets say you have a 1 billion row table which you do some 
aggregation on and left with a 10K rows table. If you do the python UDF in the 
beginning then it might have a hard hit but if you do it on the 10K rows table 
then the overhead might be negligible.
Furthermore, you can always write the UDF in scala and wrap it.
This is something my team did. We have data scientists working on spark in 
python. Normally, they can use the existing functions to do what they need 
(Spark already has a pretty nice spread of functions which answer most of the 
common use cases). When they need a new UDF or UDAF they simply ask my team 
(which does the engineering) and we write them a scala one and then wrap it to 
be accessible from python.


From: ayan guha [mailto:[hidden 
email]<http://user/SendEmail.jtp?type=node=27650=0>]
Sent: Friday, September 02, 2016 12:21 AM
To: kant kodali
Cc: Mendelson, Assaf; user
Subject: Re: Scala Vs Python

Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the only 
construct I am operating on.  Dataframe supposed to be language-agnostic in 
terms of performance.  So why people think python is slower? is it because of 
using UDF? Any other reason?

Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.

@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden 
email]<http://user/SendEmail.jtp?type=node=27650=1>> wrote:
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or 
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 
page essay on why that wouldn't work so great. you might be wondering why would 
spark have it then? well probably because its ease of use for ML (that would be 
my best guess).
[https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI]





On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden 
email]<http://user/SendEmail.jtp?type=node=27650=2> wrote:

I believe this would greatly dep

Re: Scala Vs Python

2016-09-04 Thread Simon Edelhaus
Any thoughts about Spark and Erlang?


-- ttfn
Simon Edelhaus
California 2016

On Sun, Sep 4, 2016 at 1:00 AM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> This one is quite interesting. Is it possible to share few toy examples?
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <assaf.mendel...@rsa.com>
> wrote:
>
>> I am not aware of any official testing but you can easily create your own.
>>
>> In testing I made I saw that python UDF were more than 10 times slower
>> than scala UDF (and in some cases it was closer to 50 times slower).
>>
>> That said, it would depend on how you use your UDF.
>>
>> For example, lets say you have a 1 billion row table which you do some
>> aggregation on and left with a 10K rows table. If you do the python UDF in
>> the beginning then it might have a hard hit but if you do it on the 10K
>> rows table then the overhead might be negligible.
>>
>> Furthermore, you can always write the UDF in scala and wrap it.
>>
>> This is something my team did. We have data scientists working on spark
>> in python. Normally, they can use the existing functions to do what they
>> need (Spark already has a pretty nice spread of functions which answer most
>> of the common use cases). When they need a new UDF or UDAF they simply ask
>> my team (which does the engineering) and we write them a scala one and then
>> wrap it to be accessible from python.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> <http:///user/SendEmail.jtp?type=node=27650=0>]
>> *Sent:* Friday, September 02, 2016 12:21 AM
>> *To:* kant kodali
>> *Cc:* Mendelson, Assaf; user
>> *Subject:* Re: Scala Vs Python
>>
>>
>>
>> Thanks All for your replies.
>>
>>
>>
>> Feature Parity:
>>
>>
>>
>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>> now at par in functionality too, I believe. However, what really worries me
>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>
>>
>>
>> Performance:
>>
>> I do  get this bit when RDDs are involved, but not when Data frame is the
>> only construct I am operating on.  Dataframe supposed to be
>> language-agnostic in terms of performance.  So why people think python is
>> slower? is it because of using UDF? Any other reason?
>>
>>
>>
>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>> comparison? like the one out there  b/w RDDs.*
>>
>>
>>
>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>> applications only. I would be glad to hear your opinion on why pyspark
>> applications will not work, if you have any benchmarks please share if
>> possible.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
>> <http:///user/SendEmail.jtp?type=node=27650=1>> wrote:
>>
>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>> write a 10 page essay on why that wouldn't work so great. you might be
>> wondering why would spark have it then? well probably because its ease of
>> use for ML (that would be my best guess).
>>
>>
>>
>>
>>
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
>> <http:///user/SendEmail.jtp?type=node=27650=2> wrote:
>>
>> I believe this would greatly depend on your use case and your familiarity
>> with the languages.
>>
>>
>>
>> In general, scala would have a much better performance than python and
>> not all interfaces are available in python.
>>
>> That said, if you are planning to use dataframes without any UDF then the
>> performance hit is practically nonexistent.
>>
>> Even if you need UDF, it is possible to write those in scala and wrap
>> them for python and still get away without the performance hit.
>>
>> Python does not have interfaces for UDAFs.
>>
>>
>>
>> I believe that if you have large structured data and do not generally
>> need UDF/UDAF you can certainly work in python without losing too much.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> <http://user/SendEmail.jtp?type=node=27637=0>]
>> *Sent:* Thursday, September 01, 2016 5:03 AM
>> *To:* user
>> *Subject:* Scala Vs Python
>>
>>
>>

Re: Scala Vs Python

2016-09-04 Thread ayan guha
Hi

This one is quite interesting. Is it possible to share few toy examples?

On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <assaf.mendel...@rsa.com>
wrote:

> I am not aware of any official testing but you can easily create your own.
>
> In testing I made I saw that python UDF were more than 10 times slower
> than scala UDF (and in some cases it was closer to 50 times slower).
>
> That said, it would depend on how you use your UDF.
>
> For example, lets say you have a 1 billion row table which you do some
> aggregation on and left with a 10K rows table. If you do the python UDF in
> the beginning then it might have a hard hit but if you do it on the 10K
> rows table then the overhead might be negligible.
>
> Furthermore, you can always write the UDF in scala and wrap it.
>
> This is something my team did. We have data scientists working on spark in
> python. Normally, they can use the existing functions to do what they need
> (Spark already has a pretty nice spread of functions which answer most of
> the common use cases). When they need a new UDF or UDAF they simply ask my
> team (which does the engineering) and we write them a scala one and then
> wrap it to be accessible from python.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node=27650=0>]
> *Sent:* Friday, September 02, 2016 12:21 AM
> *To:* kant kodali
> *Cc:* Mendelson, Assaf; user
> *Subject:* Re: Scala Vs Python
>
>
>
> Thanks All for your replies.
>
>
>
> Feature Parity:
>
>
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
>
>
> Performance:
>
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
>
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
>
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
> <http:///user/SendEmail.jtp?type=node=27650=1>> wrote:
>
> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases
> or Large Scale Distributed Systems makes absolutely no sense. I can write a
> 10 page essay on why that wouldn't work so great. you might be wondering
> why would spark have it then? well probably because its ease of use for ML
> (that would be my best guess).
>
>
>
>
>
> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
> <http:///user/SendEmail.jtp?type=node=27650=2> wrote:
>
> I believe this would greatly depend on your use case and your familiarity
> with the languages.
>
>
>
> In general, scala would have a much better performance than python and not
> all interfaces are available in python.
>
> That said, if you are planning to use dataframes without any UDF then the
> performance hit is practically nonexistent.
>
> Even if you need UDF, it is possible to write those in scala and wrap them
> for python and still get away without the performance hit.
>
> Python does not have interfaces for UDAFs.
>
>
>
> I believe that if you have large structured data and do not generally need
> UDF/UDAF you can certainly work in python without losing too much.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> <http://user/SendEmail.jtp?type=node=27637=0>]
> *Sent:* Thursday, September 01, 2016 5:03 AM
> *To:* user
> *Subject:* Scala Vs Python
>
>
>
> Hi Users
>
>
>
> Thought to ask (again and again) the question: While I am building any
> production application, should I use Scala or Python?
>
>
>
> I have read many if not most articles but all seems pre-Spark 2. Anything
> changed with Spark 2? Either pro-scala way or pro-python way?
>
>
>
> I am thinking performance, feature parity and future direction, not so
> much in terms of skillset or ease of use.
>
>
>
> Or, if you think it is a moot point, please say so as well.
>
>
>
> Any real life example, production experience, anecdotes, personal taste,
> profanity all are welcome :)
>
>
>
> --

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I am not aware of any official testing but you can easily create your own.
In testing I made I saw that python UDF were more than 10 times slower than 
scala UDF (and in some cases it was closer to 50 times slower).
That said, it would depend on how you use your UDF.
For example, lets say you have a 1 billion row table which you do some 
aggregation on and left with a 10K rows table. If you do the python UDF in the 
beginning then it might have a hard hit but if you do it on the 10K rows table 
then the overhead might be negligible.
Furthermore, you can always write the UDF in scala and wrap it.
This is something my team did. We have data scientists working on spark in 
python. Normally, they can use the existing functions to do what they need 
(Spark already has a pretty nice spread of functions which answer most of the 
common use cases). When they need a new UDF or UDAF they simply ask my team 
(which does the engineering) and we write them a scala one and then wrap it to 
be accessible from python.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Friday, September 02, 2016 12:21 AM
To: kant kodali
Cc: Mendelson, Assaf; user
Subject: Re: Scala Vs Python

Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the only 
construct I am operating on.  Dataframe supposed to be language-agnostic in 
terms of performance.  So why people think python is slower? is it because of 
using UDF? Any other reason?

Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.

@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or 
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 
page essay on why that wouldn't work so great. you might be wondering why would 
spark have it then? well probably because its ease of use for ML (that would be 
my best guess).
[https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI]





On Wed, Aug 31, 2016 11:45 PM, AssafMendelson 
assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com> wrote:

I believe this would greatly depend on your use case and your familiarity with 
the languages.



In general, scala would have a much better performance than python and not all 
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.

Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.

Python does not have interfaces for UDAFs.



I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.





From: ayan guha [mailto:[hidden 
email]<http://user/SendEmail.jtp?type=node=27637=0>]
Sent: Thursday, September 01, 2016 5:03 AM
To: user
Subject: Scala Vs Python



Hi Users



Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?



I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?



I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.



Or, if you think it is a moot point, please say so as well.



Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)



--

Best Regards,
Ayan Guha


View this message in context: RE: Scala Vs 
Python<http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.



--
Best Regards,
Ayan Guha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Scala Vs Python

2016-09-02 Thread darren
I politely disagree. The jvm is one vm. Python has another. It's less about 
preference and more about where the skills as an industry is going for data 
analysis and BI etc. No cares about jvm vs. Pvm. They do care about time. So if 
the time to prototype is 10x faster (in calendar days) but the VM is slower in 
cpu cycles, the greater benefit decides what's best. The industry trend is 
clear now.
And seemingly spark is moving in its own direction. In my opinion of course.


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Sivakumaran S <siva.kuma...@me.com> 
Date: 9/2/16  4:03 AM  (GMT-05:00) To: Mich Talebzadeh 
<mich.talebza...@gmail.com> Cc: Jakob Odersky <ja...@odersky.com>, ayan guha 
<guha.a...@gmail.com>, Tal Grynbaum <tal.grynb...@gmail.com>, darren 
<dar...@ontrenet.com>, kant kodali <kanth...@gmail.com>, AssafMendelson 
<assaf.mendel...@rsa.com>, user <user@spark.apache.org> Subject: Re: Scala Vs 
Python 
Whatever benefits you may accrue from the rapid prototyping and coding in 
Python, it will be offset against the time taken to convert it to run inside 
the JVM. This of course depends on the complexity of the DAG. I guess it is a 
matter of language preference. 
Regards,
Sivakumaran S
On 02-Sep-2016, at 8:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
From an outsider point of view nobody likes change :)
However, it appears to me that Scala is a rising star and if one learns it, it 
is another iron in the fire so to speak. I believe as we progress in time Spark 
is going to move away from Python. If you look at 2014 Databricks code 
examples, they were mostly in Python. Now they are mostly in Scala for a reason.
HTH




Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 2 September 2016 at 08:23, Jakob Odersky <ja...@odersky.com> wrote:
Forgot to answer your question about feature parity of Python w.r.t. Spark's 
different components
I mostly work with scala so I can't say for sure but I think that all pre-2.0 
features (that's basically everything except Structured Streaming) are on par. 
Structured Streaming is a pretty new feature and Python support is currently 
not available. The API is not final however and I reckon that Python support 
will arrive once it gets finalized, probably in the next version.







Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
No offence taken. Glad that it was rectified.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 16:03, Nicholas Chammas 
wrote:

> I apologize for my harsh tone. You are right, it was unnecessary and
> discourteous.
>
> On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> You made such statement:
>>
>> "That's complete nonsense."
>>
>> That is a strong language and void of any courtesy. Only dogmatic
>> individuals make such statements, engaging the keyboard before thinking
>> about it.
>>
>> You are perfectly in your right to agree to differ. However, that does
>> not give you the right to call other peoples opinion nonsense.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 September 2016 at 15:54, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> You made a specific claim -- that Spark will move away from Python --
>>> which I responded to with clear references and data. How on earth is that a
>>> "religious argument"?
>>>
>>> I'm not saying that Python is better than Scala or anything like that.
>>> I'm just addressing your specific claim about its future in the Spark
>>> project.
>>>
>>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Right so. We are back into religious arguments. Best of luck



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 2 September 2016 at 15:35, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I believe as we progress in time Spark is going to move away from
>> Python. If you look at 2014 Databricks code examples, they were
>> mostly in Python. Now they are mostly in Scala for a reason.
>>
>
> That's complete nonsense.
>
> First off, you can find dozens and dozens of Python code examples
> here: https://github.com/apache/spark/tree/master/
> examples/src/main/python
>
> The Python API was added to Spark in 0.7.0
> , back in
> February of 2013, before Spark was even accepted into the Apache 
> incubator.
> Since then it's undergone major and continuous development. Though it does
> lag behind the Scala API in some areas, it's a first-class language and
> bringing it up to parity with Scala is an explicit project goal. A quick
> example off the top of my head is all the work that's going into model
> import/export for Python: SPARK-11939
> 
>
> Additionally, according to the 2015 Spark Survey
> ,
> 58% of Spark users use the Python API, more than any other language save
> for Scala (71%). (Users can select multiple languages on the survey.)
> Python users were also the 3rd-fastest growing "demographic" for Spark,
> after Windows and Spark Streaming users.
>
> Any notion that Spark is going to "move away from Python" is
> completely 

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
I apologize for my harsh tone. You are right, it was unnecessary and
discourteous.

On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh 
wrote:

> Hi,
>
> You made such statement:
>
> "That's complete nonsense."
>
> That is a strong language and void of any courtesy. Only dogmatic
> individuals make such statements, engaging the keyboard before thinking
> about it.
>
> You are perfectly in your right to agree to differ. However, that does not
> give you the right to call other peoples opinion nonsense.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:54, Nicholas Chammas  > wrote:
>
>> You made a specific claim -- that Spark will move away from Python --
>> which I responded to with clear references and data. How on earth is that a
>> "religious argument"?
>>
>> I'm not saying that Python is better than Scala or anything like that.
>> I'm just addressing your specific claim about its future in the Spark
>> project.
>>
>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Right so. We are back into religious arguments. Best of luck
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 September 2016 at 15:35, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> I believe as we progress in time Spark is going to move away from
> Python. If you look at 2014 Databricks code examples, they were
> mostly in Python. Now they are mostly in Scala for a reason.
>

 That's complete nonsense.

 First off, you can find dozens and dozens of Python code examples here:
 https://github.com/apache/spark/tree/master/examples/src/main/python

 The Python API was added to Spark in 0.7.0
 , back in
 February of 2013, before Spark was even accepted into the Apache incubator.
 Since then it's undergone major and continuous development. Though it does
 lag behind the Scala API in some areas, it's a first-class language and
 bringing it up to parity with Scala is an explicit project goal. A quick
 example off the top of my head is all the work that's going into model
 import/export for Python: SPARK-11939
 

 Additionally, according to the 2015 Spark Survey
 ,
 58% of Spark users use the Python API, more than any other language save
 for Scala (71%). (Users can select multiple languages on the survey.)
 Python users were also the 3rd-fastest growing "demographic" for Spark,
 after Windows and Spark Streaming users.

 Any notion that Spark is going to "move away from Python" is completely
 contradicted by the facts.

 Nick


>>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
You made a specific claim -- that Spark will move away from Python -- which
I responded to with clear references and data. How on earth is that a
"religious argument"?

I'm not saying that Python is better than Scala or anything like that. I'm
just addressing your specific claim about its future in the Spark project.

On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh 
wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> , back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> 
>>
>> Additionally, according to the 2015 Spark Survey
>> ,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
>


Re: Scala Vs Python

2016-09-02 Thread andy petrella
looking at the examples, indeed they make nonsense :D

On Fri, 2 Sep 2016 16:48 Mich Talebzadeh,  wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> , back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> 
>>
>> Additionally, according to the 2015 Spark Survey
>> ,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
> --
andy


Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
Right so. We are back into religious arguments. Best of luck



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 15:35, Nicholas Chammas 
wrote:

> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
> wrote:
>
>> I believe as we progress in time Spark is going to move away from Python. If
>> you look at 2014 Databricks code examples, they were mostly in Python. Now
>> they are mostly in Scala for a reason.
>>
>
> That's complete nonsense.
>
> First off, you can find dozens and dozens of Python code examples here:
> https://github.com/apache/spark/tree/master/examples/src/main/python
>
> The Python API was added to Spark in 0.7.0
> , back in
> February of 2013, before Spark was even accepted into the Apache incubator.
> Since then it's undergone major and continuous development. Though it does
> lag behind the Scala API in some areas, it's a first-class language and
> bringing it up to parity with Scala is an explicit project goal. A quick
> example off the top of my head is all the work that's going into model
> import/export for Python: SPARK-11939
> 
>
> Additionally, according to the 2015 Spark Survey
> ,
> 58% of Spark users use the Python API, more than any other language save
> for Scala (71%). (Users can select multiple languages on the survey.)
> Python users were also the 3rd-fastest growing "demographic" for Spark,
> after Windows and Spark Streaming users.
>
> Any notion that Spark is going to "move away from Python" is completely
> contradicted by the facts.
>
> Nick
>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
wrote:

> I believe as we progress in time Spark is going to move away from Python. If
> you look at 2014 Databricks code examples, they were mostly in Python. Now
> they are mostly in Scala for a reason.
>

That's complete nonsense.

First off, you can find dozens and dozens of Python code examples here:
https://github.com/apache/spark/tree/master/examples/src/main/python

The Python API was added to Spark in 0.7.0
, back in February
of 2013, before Spark was even accepted into the Apache incubator. Since
then it's undergone major and continuous development. Though it does lag
behind the Scala API in some areas, it's a first-class language and
bringing it up to parity with Scala is an explicit project goal. A quick
example off the top of my head is all the work that's going into model
import/export for Python: SPARK-11939


Additionally, according to the 2015 Spark Survey
,
58% of Spark users use the Python API, more than any other language save
for Scala (71%). (Users can select multiple languages on the survey.)
Python users were also the 3rd-fastest growing "demographic" for Spark,
after Windows and Spark Streaming users.

Any notion that Spark is going to "move away from Python" is completely
contradicted by the facts.

Nick


Re: Scala Vs Python

2016-09-02 Thread Sivakumaran S
Whatever benefits you may accrue from the rapid prototyping and coding in 
Python, it will be offset against the time taken to convert it to run inside 
the JVM. This of course depends on the complexity of the DAG. I guess it is a 
matter of language preference. 

Regards,

Sivakumaran S
> On 02-Sep-2016, at 8:58 AM, Mich Talebzadeh  wrote:
> 
> From an outsider point of view nobody likes change :)
> 
> However, it appears to me that Scala is a rising star and if one learns it, 
> it is another iron in the fire so to speak. I believe as we progress in time 
> Spark is going to move away from Python. If you look at 2014 Databricks code 
> examples, they were mostly in Python. Now they are mostly in Scala for a 
> reason.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 08:23, Jakob Odersky  > wrote:
> Forgot to answer your question about feature parity of Python w.r.t. Spark's 
> different components
> I mostly work with scala so I can't say for sure but I think that all pre-2.0 
> features (that's basically everything except Structured Streaming) are on 
> par. Structured Streaming is a pretty new feature and Python support is 
> currently not available. The API is not final however and I reckon that 
> Python support will arrive once it gets finalized, probably in the next 
> version.
> 
> 



Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
>From an outsider point of view nobody likes change :)

However, it appears to me that Scala is a rising star and if one learns it,
it is another iron in the fire so to speak. I believe as we progress in
time Spark is going to move away from Python. If you look at 2014
Databricks code examples, they were mostly in Python. Now they are mostly
in Scala for a reason.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 08:23, Jakob Odersky  wrote:

> Forgot to answer your question about feature parity of Python w.r.t.
> Spark's different components
> I mostly work with scala so I can't say for sure but I think that all
> pre-2.0 features (that's basically everything except Structured Streaming)
> are on par. Structured Streaming is a pretty new feature and Python support
> is currently not available. The API is not final however and I reckon that
> Python support will arrive once it gets finalized, probably in the next
> version.
>
>


Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
Forgot to answer your question about feature parity of Python w.r.t.
Spark's different components
I mostly work with scala so I can't say for sure but I think that all
pre-2.0 features (that's basically everything except Structured Streaming)
are on par. Structured Streaming is a pretty new feature and Python support
is currently not available. The API is not final however and I reckon that
Python support will arrive once it gets finalized, probably in the next
version.


Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
As you point out, often the reason that Python support lags behind is that
functionality is implemented in Scala, so the API in that language is
"free" whereas Python support needs to be added explicitly. Nevertheless,
Python bindings are an important part of Spark and is used by many people
(this info could be outdated but Python used to be the second most popular
language after Scala). I expect Python support to only get better in the
future so I think it is fair to say that Python is a first-class citizen in
Spark.

Regarding performance, the issue is more complicated. This is mostly due to
the fact that the actual execution of actions happens in JVM-land and any
correspondance between Python and the JVM is expensive. So the question
basically boils down to "how often does python need to communicate with the
JVM"? The answer depends on the Spark APIs you're using:

1. Plain old RDDs: for every function you pass to a transformation (filter,
map, etc) an intermediate result will be shipped to a Pyhon interpreter,
the function applied, and finally the result shipped back to the JVM.
2. DataFrames with RDD-like transformations or User Defined Functions: same
as point 1, any functions are applied in a Python environment and hence
data needs to be transferred.
3. DataFrames with only SQL expressions: Spark query optimizer will take
care of computing and executing an internal representation of your
transformations and no data communication needs to happen between Python
and the JVM (apart from final results in case you asked for them, i.e. by
calling a collect()).

In cases 1 and 2, there will be a lack in performance compared to
equivalent Scala or Java versions. The difference in case 3 is negligible
as all language APIs will share the same backend .See this blog post from
Databricks for some more detailed information:
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

I hope this was the kind of information you were looking for. Please note
however that performance in Spark is a complex topic, the scenarios I
mentioned above should nevertheless give you some rule of thumb.

best,
--Jakob

On Thu, Sep 1, 2016 at 11:25 PM, ayan guha  wrote:

> Tal: I think by nature of the project itself, Python APIs are developed
> after Scala and Java, and it is a fair trade off between speed of getting
> stuff to market. And more and more this discussion is progressing, I see
> not much issue in terms of feature parity.
>
> Coming back to performance, Darren raised a good point: if I can scale
> out, individual VM performance should not matter much. But performance is
> often stated as a definitive downside of using Python over scala/java. I am
> trying to understand the truth and myth behind this claim. Any pointer
> would be great.
>
> best
> Ayan
>
> On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
> wrote:
>
>>
>> On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:
>>
>>> This topic is a concern for us as well. In the data science world no one
>>> uses native scala or java by choice. It's R and Python. And python is
>>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>>
>>> This is why we have decoupled from spark in our project. It's really
>>> unfortunate spark team have invested so heavily in scale.
>>>
>>> As for speed it comes from horizontal scaling and throughout. When you
>>> can scale outward, individual VM performance is less an issue. Basic HPC
>>> principles.
>>>
>>
>> Darren,
>>
>> My guess is that data scientist who will decouple themselves from spark,
>> will eventually left with more or less nothing. (single process
>> capabilities, or purely performing HPC's) (unless, unlikely, some good
>> spark competitor will emerge.  unlikely, simply because there is no need
>> for such).
>> But putting guessing aside - the reason python is 3rd in line for feature
>> support, is not because the spark developers were busy with scala, it's
>> because the features that are missing are those that support strong typing.
>> which is not relevant to python.  in other words, even if spark was
>> rewritten in python, and was to focus on python only, you would still not
>> get those features.
>>
>>
>>
>> --
>> *Tal Grynbaum* / *CTO & co-founder*
>>
>> m# +972-54-7875797
>>
>> mobile retention done right
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


RE: Scala Vs Python

2016-09-02 Thread Santoshakhilesh
I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and 
Datasets @ Apache Spark - NE Scala 2016

At 15:00 there is a slide to show a comparison of aggregating 10 Million 
integer pairs using RDD ,  DataFrame with different language bindings like 
Scala , Python , R

As per this slide
DataFrame APIs outperform RDDs and all the Language bindings performance are 
same
RDD with Python is way slower than Scala version So I guess there should be 
some reality in Scala bindings being faster in some case.

@ 30:23 he presents a slide to show the performance of serialization and 
Dataset encoders are way faster than Java and Kyro.

But as always proof of pudding is in eating so why don’t you try some samples 
to see yourself.
I personally have found that my app runs a bit faster with Scala version than 
Java but I am not yet able to figure out the reason.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 02 September 2016 15:25
To: Tal Grynbaum
Cc: darren; Mich Talebzadeh; Jakob Odersky; kant kodali; AssafMendelson; user
Subject: Re: Scala Vs Python

Tal: I think by nature of the project itself, Python APIs are developed after 
Scala and Java, and it is a fair trade off between speed of getting stuff to 
market. And more and more this discussion is progressing, I see not much issue 
in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out, 
individual VM performance should not matter much. But performance is often 
stated as a definitive downside of using Python over scala/java. I am trying to 
understand the truth and myth behind this claim. Any pointer would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
<tal.grynb...@gmail.com<mailto:tal.grynb...@gmail.com>> wrote:

On Fri, Sep 2, 2016 at 1:15 AM, darren 
<dar...@ontrenet.com<mailto:dar...@ontrenet.com>> wrote:
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.

This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale.

As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.

Darren,

My guess is that data scientist who will decouple themselves from spark, will 
eventually left with more or less nothing. (single process capabilities, or 
purely performing HPC's) (unless, unlikely, some good spark competitor will 
emerge.  unlikely, simply because there is no need for such).
But putting guessing aside - the reason python is 3rd in line for feature 
support, is not because the spark developers were busy with scala, it's because 
the features that are missing are those that support strong typing. which is 
not relevant to python.  in other words, even if spark was rewritten in python, 
and was to focus on python only, you would still not get those features.



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797
[cid:image001.png@01D20532.AC944EB0]
mobile retention done right



--
Best Regards,
Ayan Guha


Re: Scala Vs Python

2016-09-02 Thread ayan guha
Tal: I think by nature of the project itself, Python APIs are developed
after Scala and Java, and it is a fair trade off between speed of getting
stuff to market. And more and more this discussion is progressing, I see
not much issue in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out,
individual VM performance should not matter much. But performance is often
stated as a definitive downside of using Python over scala/java. I am
trying to understand the truth and myth behind this claim. Any pointer
would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum  wrote:

>
> On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:
>
>> This topic is a concern for us as well. In the data science world no one
>> uses native scala or java by choice. It's R and Python. And python is
>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>
>> This is why we have decoupled from spark in our project. It's really
>> unfortunate spark team have invested so heavily in scale.
>>
>> As for speed it comes from horizontal scaling and throughout. When you
>> can scale outward, individual VM performance is less an issue. Basic HPC
>> principles.
>>
>
> Darren,
>
> My guess is that data scientist who will decouple themselves from spark,
> will eventually left with more or less nothing. (single process
> capabilities, or purely performing HPC's) (unless, unlikely, some good
> spark competitor will emerge.  unlikely, simply because there is no need
> for such).
> But putting guessing aside - the reason python is 3rd in line for feature
> support, is not because the spark developers were busy with scala, it's
> because the features that are missing are those that support strong typing.
> which is not relevant to python.  in other words, even if spark was
> rewritten in python, and was to focus on python only, you would still not
> get those features.
>
>
>
> --
> *Tal Grynbaum* / *CTO & co-founder*
>
> m# +972-54-7875797
>
> mobile retention done right
>



-- 
Best Regards,
Ayan Guha


Re: Scala Vs Python

2016-09-02 Thread Tal Grynbaum
On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:

> This topic is a concern for us as well. In the data science world no one
> uses native scala or java by choice. It's R and Python. And python is
> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>
> This is why we have decoupled from spark in our project. It's really
> unfortunate spark team have invested so heavily in scale.
>
> As for speed it comes from horizontal scaling and throughout. When you can
> scale outward, individual VM performance is less an issue. Basic HPC
> principles.
>

Darren,

My guess is that data scientist who will decouple themselves from spark,
will eventually left with more or less nothing. (single process
capabilities, or purely performing HPC's) (unless, unlikely, some good
spark competitor will emerge.  unlikely, simply because there is no need
for such).
But putting guessing aside - the reason python is 3rd in line for feature
support, is not because the spark developers were busy with scala, it's
because the features that are missing are those that support strong typing.
which is not relevant to python.  in other words, even if spark was
rewritten in python, and was to focus on python only, you would still not
get those features.



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: Scala Vs Python

2016-09-01 Thread ayan guha
really worries me is not having Dataset APIs at all in
>>>>> Python. I think thats a deal breaker.
>>>>>
>>>>> What is the functionality you are missing? In Spark 2.0 a DataFrame is
>>>>> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
>>>>> core/.../o/a/s/sql/package.scala).
>>>>> Since python is dynamically typed, you wouldn't really gain anything
>>>>> by using Datasets anyway.
>>>>>
>>>>> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> Thanks All for your replies.
>>>>>>
>>>>>> Feature Parity:
>>>>>>
>>>>>> MLLib, RDD and dataframes features are totally comparable. Streaming
>>>>>> is now at par in functionality too, I believe. However, what really 
>>>>>> worries
>>>>>> me is not having Dataset APIs at all in Python. I think thats a deal
>>>>>> breaker.
>>>>>>
>>>>>> Performance:
>>>>>> I do  get this bit when RDDs are involved, but not when Data frame is
>>>>>> the only construct I am operating on.  Dataframe supposed to be
>>>>>> language-agnostic in terms of performance.  So why people think python is
>>>>>> slower? is it because of using UDF? Any other reason?
>>>>>>
>>>>>> *Is there any kind of benchmarking/stats around Python UDF vs Scala
>>>>>> UDF comparison? like the one out there  b/w RDDs.*
>>>>>>
>>>>>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>>>>>> applications only. I would be glad to hear your opinion on why pyspark
>>>>>> applications will not work, if you have any benchmarks please share if
>>>>>> possible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>>>>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I 
>>>>>>> can
>>>>>>> write a 10 page essay on why that wouldn't work so great. you might be
>>>>>>> wondering why would spark have it then? well probably because its ease 
>>>>>>> of
>>>>>>> use for ML (that would be my best guess).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson
>>>>>>> assaf.mendel...@rsa.com wrote:
>>>>>>>
>>>>>>>> I believe this would greatly depend on your use case and your
>>>>>>>> familiarity with the languages.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In general, scala would have a much better performance than python
>>>>>>>> and not all interfaces are available in python.
>>>>>>>>
>>>>>>>> That said, if you are planning to use dataframes without any UDF
>>>>>>>> then the performance hit is practically nonexistent.
>>>>>>>>
>>>>>>>> Even if you need UDF, it is possible to write those in scala and
>>>>>>>> wrap them for python and still get away without the performance hit.
>>>>>>>>
>>>>>>>> Python does not have interfaces for UDAFs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe that if you have large structured data and do not
>>>>>>>> generally need UDF/UDAF you can certainly work in python without 
>>>>>>>> losing too
>>>>>>>> much.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* ayan guha [mailto:[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>>>>>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>>>>>> *To:* user
>>>>>>>> *Subject:* Scala Vs Python
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thought to ask (again and again) the question: While I am building
>>>>>>>> any production application, should I use Scala or Python?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have read many if not most articles but all seems pre-Spark 2.
>>>>>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I am thinking performance, feature parity and future direction, not
>>>>>>>> so much in terms of skillset or ease of use.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Or, if you think it is a moot point, please say so as well.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Any real life example, production experience, anecdotes, personal
>>>>>>>> taste, profanity all are welcome :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context: RE: Scala Vs Python
>>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>>>>>>> Sent from the Apache Spark User List mailing list archive
>>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
ll in Python. I think thats a deal
>>>>> breaker.
>>>>>
>>>>> Performance:
>>>>> I do  get this bit when RDDs are involved, but not when Data frame is
>>>>> the only construct I am operating on.  Dataframe supposed to be
>>>>> language-agnostic in terms of performance.  So why people think python is
>>>>> slower? is it because of using UDF? Any other reason?
>>>>>
>>>>> *Is there any kind of benchmarking/stats around Python UDF vs Scala
>>>>> UDF comparison? like the one out there  b/w RDDs.*
>>>>>
>>>>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>>>>> applications only. I would be glad to hear your opinion on why pyspark
>>>>> applications will not work, if you have any benchmarks please share if
>>>>> possible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>>>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>>>>>> write a 10 page essay on why that wouldn't work so great. you might be
>>>>>> wondering why would spark have it then? well probably because its ease of
>>>>>> use for ML (that would be my best guess).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>>>>>> wrote:
>>>>>>
>>>>>>> I believe this would greatly depend on your use case and your
>>>>>>> familiarity with the languages.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In general, scala would have a much better performance than python
>>>>>>> and not all interfaces are available in python.
>>>>>>>
>>>>>>> That said, if you are planning to use dataframes without any UDF
>>>>>>> then the performance hit is practically nonexistent.
>>>>>>>
>>>>>>> Even if you need UDF, it is possible to write those in scala and
>>>>>>> wrap them for python and still get away without the performance hit.
>>>>>>>
>>>>>>> Python does not have interfaces for UDAFs.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I believe that if you have large structured data and do not
>>>>>>> generally need UDF/UDAF you can certainly work in python without losing 
>>>>>>> too
>>>>>>> much.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* ayan guha [mailto:[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>>>>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>>>>> *To:* user
>>>>>>> *Subject:* Scala Vs Python
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thought to ask (again and again) the question: While I am building
>>>>>>> any production application, should I use Scala or Python?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have read many if not most articles but all seems pre-Spark 2.
>>>>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am thinking performance, feature parity and future direction, not
>>>>>>> so much in terms of skillset or ease of use.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Or, if you think it is a moot point, please say so as well.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any real life example, production experience, anecdotes, personal
>>>>>>> taste, profanity all are welcome :)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: RE: Scala Vs Python
>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>>>>>> Sent from the Apache Spark User List mailing list archive
>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Scala Vs Python

2016-09-01 Thread Mich Talebzadeh
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>>>>> wrote:
>>>>>
>>>>>> I believe this would greatly depend on your use case and your
>>>>>> familiarity with the languages.
>>>>>>
>>>>>>
>>>>>>
>>>>>> In general, scala would have a much better performance than python
>>>>>> and not all interfaces are available in python.
>>>>>>
>>>>>> That said, if you are planning to use dataframes without any UDF then
>>>>>> the performance hit is practically nonexistent.
>>>>>>
>>>>>> Even if you need UDF, it is possible to write those in scala and wrap
>>>>>> them for python and still get away without the performance hit.
>>>>>>
>>>>>> Python does not have interfaces for UDAFs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I believe that if you have large structured data and do not generally
>>>>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ayan guha [mailto:[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>>>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>>>> *To:* user
>>>>>> *Subject:* Scala Vs Python
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Users
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thought to ask (again and again) the question: While I am building
>>>>>> any production application, should I use Scala or Python?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have read many if not most articles but all seems pre-Spark 2.
>>>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am thinking performance, feature parity and future direction, not
>>>>>> so much in terms of skillset or ease of use.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Or, if you think it is a moot point, please say so as well.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any real life example, production experience, anecdotes, personal
>>>>>> taste, profanity all are welcome :)
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>> --
>>>>>> View this message in context: RE: Scala Vs Python
>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>>>>> Sent from the Apache Spark User List mailing list archive
>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>


Re: Scala Vs Python

2016-09-01 Thread darren
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.
This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale. 
As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Mich Talebzadeh 
<mich.talebza...@gmail.com> Date: 9/1/16  6:01 PM  (GMT-05:00) To: Jakob 
Odersky <ja...@odersky.com> Cc: ayan guha <guha.a...@gmail.com>, kant kodali 
<kanth...@gmail.com>, AssafMendelson <assaf.mendel...@rsa.com>, user 
<user@spark.apache.org> Subject: Re: Scala Vs Python 
Hi Jacob.
My understanding of Dataset is that it is basically an RDD with some 
optimization gone into it. RDD is meant to deal with unstructured data?
Now DataFrame is the tabular format of RDD designed for tabular work, csv, SQL 
stuff etc.
When you mention DataFrame is just an alias for Dataset[Row] does that mean  
that it converts an RDD to DataSet thus producing a tabular format?
Thanks



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
> However, what really worries me is not having Dataset APIs at all in Python. 
> I think thats a deal breaker.

What is the functionality you are missing? In Spark 2.0 a DataFrame is just an 
alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in 
core/.../o/a/s/sql/package.scala).
Since python is dynamically typed, you wouldn't really gain anything by using 
Datasets anyway.

On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
Thanks All for your replies.
Feature Parity: 
MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker. 
Performance: I do  get this bit when RDDs are involved, but not when Data frame 
is the only construct I am operating on.  Dataframe supposed to be 
language-agnostic in terms of performance.  So why people think python is 
slower? is it because of using UDF? Any other reason?
Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.
@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible. 





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:





  











  

  








  


  
c'mon man this is no Brainer..Dynamic Typed Languages for Large 
Code Bases or Large Scale Distributed Systems makes absolutely no sense. I can 
write a 10 page essay on why that wouldn't work so great. you might be 
wondering why would spark have it then? well probably because its ease of use 
for ML (that would be my best guess). 
  

  




  

  



  On Wed, Aug 31, 2016 11:45 PM, AssafMendelson  assaf.mendel...@rsa.com
 wrote:

  









I believe this would greatly depend on your use case and your familiarity with 
the languages.
 
In general, scala would have a much better performance than python and not all 
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.
Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.
Python does not have interfaces for UDAFs.
 
I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.
 
 
From: ayan guha [mailto:[hidden email]]


Sent: Thursday, September 01, 2016 5:03 AM

To: user

Subject: Scala Vs Python
 

Hi Users

 


Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python? 


 


I have read many if not most articles but all seems pre-Spark 2. Anything 
cha

Re: Scala Vs Python

2016-09-01 Thread Peyman Mohajerian
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Jacob.
>
> My understanding of Dataset is that it is basically an RDD with some
> optimization gone into it. RDD is meant to deal with unstructured data?
>
> Now DataFrame is the tabular format of RDD designed for tabular work, csv,
> SQL stuff etc.
>
> When you mention DataFrame is just an alias for Dataset[Row] does that
> mean  that it converts an RDD to DataSet thus producing a tabular format?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
>
>> > However, what really worries me is not having Dataset APIs at all in
>> Python. I think thats a deal breaker.
>>
>> What is the functionality you are missing? In Spark 2.0 a DataFrame is
>> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
>> core/.../o/a/s/sql/package.scala).
>> Since python is dynamically typed, you wouldn't really gain anything by
>> using Datasets anyway.
>>
>> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Thanks All for your replies.
>>>
>>> Feature Parity:
>>>
>>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>>> now at par in functionality too, I believe. However, what really worries me
>>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>>
>>> Performance:
>>> I do  get this bit when RDDs are involved, but not when Data frame is
>>> the only construct I am operating on.  Dataframe supposed to be
>>> language-agnostic in terms of performance.  So why people think python is
>>> slower? is it because of using UDF? Any other reason?
>>>
>>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>>> comparison? like the one out there  b/w RDDs.*
>>>
>>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>>> applications only. I would be glad to hear your opinion on why pyspark
>>> applications will not work, if you have any benchmarks please share if
>>> possible.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>>>> write a 10 page essay on why that wouldn't work so great. you might be
>>>> wondering why would spark have it then? well probably because its ease of
>>>> use for ML (that would be my best guess).
>>>>
>>>>
>>>>
>>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>>>> wrote:
>>>>
>>>>> I believe this would greatly depend on your use case and your
>>>>> familiarity with the languages.
>>>>>
>>>>>
>>>>>
>>>>> In general, scala would have a much better performance than python and
>>>>> not all interfaces are available in python.
>>>>>
>>>>> That said, if you are planning to use dataframes without any UDF then
>>>>> the performance hit is practically nonexistent.
>>>>>
>>>>> Even if you need UDF, it is possible to write those in scala and wrap
>>>>> them for python and still get away without the performance hit.
>>>>>
>>>>> Python does not have interfaces for UDAFs.
>>>>>
>>>>>
>>>>>
>>>>> I believe that if you have large structured data and do not generally
>>>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>>>
>

Re: Scala Vs Python

2016-09-01 Thread Mich Talebzadeh
Hi Jacob.

My understanding of Dataset is that it is basically an RDD with some
optimization gone into it. RDD is meant to deal with unstructured data?

Now DataFrame is the tabular format of RDD designed for tabular work, csv,
SQL stuff etc.

When you mention DataFrame is just an alias for Dataset[Row] does that
mean  that it converts an RDD to DataSet thus producing a tabular format?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:

> > However, what really worries me is not having Dataset APIs at all in
> Python. I think thats a deal breaker.
>
> What is the functionality you are missing? In Spark 2.0 a DataFrame is
> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
> core/.../o/a/s/sql/package.scala).
> Since python is dynamically typed, you wouldn't really gain anything by
> using Datasets anyway.
>
> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Thanks All for your replies.
>>
>> Feature Parity:
>>
>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>> now at par in functionality too, I believe. However, what really worries me
>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>
>> Performance:
>> I do  get this bit when RDDs are involved, but not when Data frame is the
>> only construct I am operating on.  Dataframe supposed to be
>> language-agnostic in terms of performance.  So why people think python is
>> slower? is it because of using UDF? Any other reason?
>>
>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>> comparison? like the one out there  b/w RDDs.*
>>
>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>> applications only. I would be glad to hear your opinion on why pyspark
>> applications will not work, if you have any benchmarks please share if
>> possible.
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>>> write a 10 page essay on why that wouldn't work so great. you might be
>>> wondering why would spark have it then? well probably because its ease of
>>> use for ML (that would be my best guess).
>>>
>>>
>>>
>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>>> wrote:
>>>
>>>> I believe this would greatly depend on your use case and your
>>>> familiarity with the languages.
>>>>
>>>>
>>>>
>>>> In general, scala would have a much better performance than python and
>>>> not all interfaces are available in python.
>>>>
>>>> That said, if you are planning to use dataframes without any UDF then
>>>> the performance hit is practically nonexistent.
>>>>
>>>> Even if you need UDF, it is possible to write those in scala and wrap
>>>> them for python and still get away without the performance hit.
>>>>
>>>> Python does not have interfaces for UDAFs.
>>>>
>>>>
>>>>
>>>> I believe that if you have large structured data and do not generally
>>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* ayan guha [mailto:[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>> *To:* user
>>>> *Subject:* Scala Vs Python
>>>>
>>>>
>>>>
>>>> Hi Users
>>>>
>>>>
>>>>
>>>> Thought to ask (again and again) the question: While I am building any
>>>> production application, should I use Scala or Python?
>>>>
>>>>
>

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
> However, what really worries me is not having Dataset APIs at all in
Python. I think thats a deal breaker.

What is the functionality you are missing? In Spark 2.0 a DataFrame is just
an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
core/.../o/a/s/sql/package.scala).
Since python is dynamically typed, you wouldn't really gain anything by
using Datasets anyway.

On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:

> Thanks All for your replies.
>
> Feature Parity:
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
> Performance:
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>> write a 10 page essay on why that wouldn't work so great. you might be
>> wondering why would spark have it then? well probably because its ease of
>> use for ML (that would be my best guess).
>>
>>
>>
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>> wrote:
>>
>>> I believe this would greatly depend on your use case and your
>>> familiarity with the languages.
>>>
>>>
>>>
>>> In general, scala would have a much better performance than python and
>>> not all interfaces are available in python.
>>>
>>> That said, if you are planning to use dataframes without any UDF then
>>> the performance hit is practically nonexistent.
>>>
>>> Even if you need UDF, it is possible to write those in scala and wrap
>>> them for python and still get away without the performance hit.
>>>
>>> Python does not have interfaces for UDAFs.
>>>
>>>
>>>
>>> I believe that if you have large structured data and do not generally
>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>
>>>
>>>
>>>
>>>
>>> *From:* ayan guha [mailto:[hidden email]
>>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>> *To:* user
>>> *Subject:* Scala Vs Python
>>>
>>>
>>>
>>> Hi Users
>>>
>>>
>>>
>>> Thought to ask (again and again) the question: While I am building any
>>> production application, should I use Scala or Python?
>>>
>>>
>>>
>>> I have read many if not most articles but all seems pre-Spark 2.
>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>
>>>
>>>
>>> I am thinking performance, feature parity and future direction, not so
>>> much in terms of skillset or ease of use.
>>>
>>>
>>>
>>> Or, if you think it is a moot point, please say so as well.
>>>
>>>
>>>
>>> Any real life example, production experience, anecdotes, personal taste,
>>> profanity all are welcome :)
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>> Ayan Guha
>>>
>>> --
>>> View this message in context: RE: Scala Vs Python
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>> Sent from the Apache Spark User List mailing list archive
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Scala Vs Python

2016-09-01 Thread ayan guha
Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now
at par in functionality too, I believe. However, what really worries me is
not having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the
only construct I am operating on.  Dataframe supposed to be
language-agnostic in terms of performance.  So why people think python is
slower? is it because of using UDF? Any other reason?

*Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
comparison? like the one out there  b/w RDDs.*

@Kant:  I am not comparing ANY applications. I am comparing SPARK
applications only. I would be glad to hear your opinion on why pyspark
applications will not work, if you have any benchmarks please share if
possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:

> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases
> or Large Scale Distributed Systems makes absolutely no sense. I can write a
> 10 page essay on why that wouldn't work so great. you might be wondering
> why would spark have it then? well probably because its ease of use for ML
> (that would be my best guess).
>
>
>
> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
> wrote:
>
>> I believe this would greatly depend on your use case and your familiarity
>> with the languages.
>>
>>
>>
>> In general, scala would have a much better performance than python and
>> not all interfaces are available in python.
>>
>> That said, if you are planning to use dataframes without any UDF then the
>> performance hit is practically nonexistent.
>>
>> Even if you need UDF, it is possible to write those in scala and wrap
>> them for python and still get away without the performance hit.
>>
>> Python does not have interfaces for UDAFs.
>>
>>
>>
>> I believe that if you have large structured data and do not generally
>> need UDF/UDAF you can certainly work in python without losing too much.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> <http:///user/SendEmail.jtp?type=node=27637=0>]
>> *Sent:* Thursday, September 01, 2016 5:03 AM
>> *To:* user
>> *Subject:* Scala Vs Python
>>
>>
>>
>> Hi Users
>>
>>
>>
>> Thought to ask (again and again) the question: While I am building any
>> production application, should I use Scala or Python?
>>
>>
>>
>> I have read many if not most articles but all seems pre-Spark 2. Anything
>> changed with Spark 2? Either pro-scala way or pro-python way?
>>
>>
>>
>> I am thinking performance, feature parity and future direction, not so
>> much in terms of skillset or ease of use.
>>
>>
>>
>> Or, if you think it is a moot point, please say so as well.
>>
>>
>>
>> Any real life example, production experience, anecdotes, personal taste,
>> profanity all are welcome :)
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>> --
>> View this message in context: RE: Scala Vs Python
>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>


-- 
Best Regards,
Ayan Guha


Re: Scala Vs Python

2016-09-01 Thread kant kodali

c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 page
essay on why that wouldn't work so great. you might be wondering why would spark
have it then? well probably because its ease of use for ML (that would be my
best guess).






On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
wrote:
I believe this would greatly depend on your use case and your familiarity with
the languages.



In general, scala would have a much better performance than python and not all
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the
performance hit is practically nonexistent.

Even if you need UDF, it is possible to write those in scala and wrap them for
python and still get away without the performance hit.

Python does not have interfaces for UDAFs.



I believe that if you have large structured data and do not generally need
UDF/UDAF you can certainly work in python without losing too much.





From:  ayan guha [mailto:[hidden email]]
Sent:  Thursday, September 01, 2016 5:03 AM
To:  user
Subject:  Scala Vs Python



Hi Users



Thought to ask (again and again) the question: While I am building any
production application, should I use Scala or Python?



I have read many if not most articles but all seems pre-Spark 2. Anything
changed with Spark 2? Either pro-scala way or pro-python way?



I am thinking performance, feature parity and future direction, not so much in
terms of skillset or ease of use.



Or, if you think it is a moot point, please say so as well.



Any real life example, production experience, anecdotes, personal taste,
profanity all are welcome :)




--

Best Regards,
Ayan Guha





View this message in context: RE: Scala Vs Python
Sent from the Apache Spark User List mailing list archive  at Nabble.com.

RE: Scala Vs Python

2016-09-01 Thread AssafMendelson
I believe this would greatly depend on your use case and your familiarity with 
the languages.

In general, scala would have a much better performance than python and not all 
interfaces are available in python.
That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.
Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.
Python does not have interfaces for UDAFs.

I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Thursday, September 01, 2016 5:03 AM
To: user
Subject: Scala Vs Python

Hi Users

Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)

--
Best Regards,
Ayan Guha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Scala Vs Python

2016-08-31 Thread Santoshakhilesh
Hi ,
I would prefer Scala if you are starting afresh , this is considering both ease 
of usage , features , performance and support.
You will find numerous examples & support with Scala which might not be true 
for any other language.
I had personally developed the first version of my App using Java 1.6 due to 
some unavoidable reasons , and my code is very verbose and ugly.
But now with Java 8 ‘s lambda support I think this is not a problem anymore. 
About Python since there is no compile time safety so If you plan to use Spark 
2.0 , Dataset API are not available.
Given a choice I would prefer to use Scala any day for very simple reason that 
I would get all the future features and optimizations out of box and I need to 
type less ☺.


Regards,
Santosh Akhilesh


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 01 September 2016 11:03
To: user
Subject: Scala Vs Python

Hi Users

Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)

--
Best Regards,
Ayan Guha


Scala Vs Python

2016-08-31 Thread ayan guha
Hi Users

Thought to ask (again and again) the question: While I am building any
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much
in terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste,
profanity all are welcome :)

-- 
Best Regards,
Ayan Guha


Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Jörn Franke
Python can access the JVM - this how it interfaces with Spark. Some of the 
components do not have a wrapper fro the corresponding Java Api yet and thus 
are not accessible in Python.

Same for elastic search. You need to write a more or less simple wrapper.

> On 20 Apr 2016, at 09:53, "kramer2...@126.com" <kramer2...@126.com> wrote:
> 
> I am using python and spark. 
> 
> I think one problem might be to communicate spark with third product. For
> example, combine spark with elasticsearch. You have to use java or scala.
> Python is not supported
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805p26806.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com
I am using python and spark. 

I think one problem might be to communicate spark with third product. For
example, combine spark with elasticsearch. You have to use java or scala.
Python is not supported



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805p26806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Zhang, Jingyu
Graphx did not support Python yet.
http://spark.apache.org/docs/latest/graphx-programming-guide.html

The workaround solution is use graphframes (3rd party API),
https://issues.apache.org/jira/browse/SPARK-3789

but some features in Python are not as same as Scala,
https://github.com/graphframes/graphframes/issues/57

Jingyu

On 20 April 2016 at 16:52, sujeet jog <sujeet@gmail.com> wrote:

> It depends on the trade off's you wish to have,
>
> Python being a interpreted language, speed of execution will be lesser,
> but it being a very common language used across, people can jump in hands
> on quickly
>
> Scala programs run in java environment,  so it's obvious you will get good
> execution speed,  although it's not common for people to know this language
> readily.
>
>
> Pyspark API's i believe will have everything which Scala Spark API's offer
> in long run.
>
>
>
> On Wed, Apr 20, 2016 at 12:14 PM, berkerkozan <berkerko...@gmail.com>
> wrote:
>
>> I know scala better than python but my team (2 other my friend) knows only
>> python. We want to use graphx or maybe try graphframes.
>> What will be the future of these 2 languages for spark ecosystem? Will
>> python cover everything scala can in short time periods? what do you
>> advice?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread sujeet jog
It depends on the trade off's you wish to have,

Python being a interpreted language, speed of execution will be lesser, but
it being a very common language used across, people can jump in hands on
quickly

Scala programs run in java environment,  so it's obvious you will get good
execution speed,  although it's not common for people to know this language
readily.


Pyspark API's i believe will have everything which Scala Spark API's offer
in long run.



On Wed, Apr 20, 2016 at 12:14 PM, berkerkozan <berkerko...@gmail.com> wrote:

> I know scala better than python but my team (2 other my friend) knows only
> python. We want to use graphx or maybe try graphframes.
> What will be the future of these 2 languages for spark ecosystem? Will
> python cover everything scala can in short time periods? what do you
> advice?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Scala vs Python for Spark ecosystem

2016-04-20 Thread berkerkozan
I know scala better than python but my team (2 other my friend) knows only
python. We want to use graphx or maybe try graphframes. 
What will be the future of these 2 languages for spark ecosystem? Will
python cover everything scala can in short time periods? what do you advice?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python performance differences

2015-01-16 Thread philpearl
I was interested in this as I had some Spark code in Python that was too slow
and wanted to know whether Scala would fix it for me.  So I re-wrote my code
in Scala.

In my particular case the Scala version was 10 times faster.  But I think
that is because I did an awful lot of computation in my own code rather than
in a library like numpy. (I put a bit more detail  here
http://tttv-engineering.tumblr.com/post/108260351966/spark-python-vs-scala  
in case you are interested)

So there's one data point, if only for the obvious data point comparing
computations in Scala to computations in pure Python.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-performance-differences-tp4247p21190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python performance differences

2015-01-16 Thread Davies Liu
Hey Phil,

Thank you sharing this. The result didn't surprise me a lot, it's normal to do
the prototype in Python, once it get stable and you really need the performance,
then rewrite part of it in C or whole of it in another language does make sense,
it will not cause you much time.

Davies

On Fri, Jan 16, 2015 at 7:38 AM, philpearl p...@tanktop.tv wrote:
 I was interested in this as I had some Spark code in Python that was too slow
 and wanted to know whether Scala would fix it for me.  So I re-wrote my code
 in Scala.

 In my particular case the Scala version was 10 times faster.  But I think
 that is because I did an awful lot of computation in my own code rather than
 in a library like numpy. (I put a bit more detail  here
 http://tttv-engineering.tumblr.com/post/108260351966/spark-python-vs-scala
 in case you are interested)

 So there's one data point, if only for the obvious data point comparing
 computations in Scala to computations in pure Python.





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-performance-differences-tp4247p21190.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python performance differences

2014-11-12 Thread Andrew Ash
Jeremy,

Did you complete this benchmark in a way that's shareable with those
interested here?

Andrew

On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'd also be interested in seeing such a benchmark.


 On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira ianferre...@hotmail.com
 wrote:

 This would be super useful. Thanks.

 On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:

 Hi Andrew,
 
 I'm putting together some benchmarks for PySpark vs Scala. I'm focusing
 on
 ML algorithms, as I'm particularly curious about the relative performance
 of
 MLlib in Scala vs the Python MLlib API vs pure Python implementations.
 
 Will share real results as soon as I have them, but roughly, in our
 hands,
 that 40% number is ballpark correct, at least for some basic operations
 (e.g
 textFile, count, reduce).
 
 -- Jeremy
 
 -
 Jeremy Freeman, PhD
 Neuroscientist
 @thefreemanlab
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-perfor
 mance-differences-tp4247p4261.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: Scala vs Python performance differences

2014-11-12 Thread Samarth Mailinglist
I was about to ask this question.

On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash and...@andrewash.com wrote:

 Jeremy,

 Did you complete this benchmark in a way that's shareable with those
 interested here?

 Andrew

 On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'd also be interested in seeing such a benchmark.


 On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira ianferre...@hotmail.com
 wrote:

 This would be super useful. Thanks.

 On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:

 Hi Andrew,
 
 I'm putting together some benchmarks for PySpark vs Scala. I'm focusing
 on
 ML algorithms, as I'm particularly curious about the relative
 performance
 of
 MLlib in Scala vs the Python MLlib API vs pure Python implementations.
 
 Will share real results as soon as I have them, but roughly, in our
 hands,
 that 40% number is ballpark correct, at least for some basic operations
 (e.g
 textFile, count, reduce).
 
 -- Jeremy
 
 -
 Jeremy Freeman, PhD
 Neuroscientist
 @thefreemanlab
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-perfor
 mance-differences-tp4247p4261.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.







Re: Scala vs Python performance differences

2014-04-15 Thread Ian Ferreira
This would be super useful. Thanks.

On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:

Hi Andrew,

I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on
ML algorithms, as I'm particularly curious about the relative performance
of
MLlib in Scala vs the Python MLlib API vs pure Python implementations.

Will share real results as soon as I have them, but roughly, in our hands,
that 40% number is ballpark correct, at least for some basic operations
(e.g
textFile, count, reduce).

-- Jeremy

-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-perfor
mance-differences-tp4247p4261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Scala vs Python performance differences

2014-04-14 Thread Andrew Ash
Hi Spark users,

I've always done all my Spark work in Scala, but occasionally people ask
about Python and its performance impact vs the same algorithm
implementation in Scala.

Has anyone done tests to measure the difference?

Anecdotally I've heard Python is a 40% slowdown but that's entirely hearsay.

Cheers,
Andrew


Re: Scala vs Python performance differences

2014-04-14 Thread Jeremy Freeman
Hi Andrew,

I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on
ML algorithms, as I'm particularly curious about the relative performance of
MLlib in Scala vs the Python MLlib API vs pure Python implementations. 

Will share real results as soon as I have them, but roughly, in our hands,
that 40% number is ballpark correct, at least for some basic operations (e.g
textFile, count, reduce).

-- Jeremy

-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-performance-differences-tp4247p4261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.