Re: Does pyspark still lag far behind the Scala API in terms of features

ayan guha Wed, 02 Mar 2016 15:22:31 -0800

+1 on all the pointers.

@Darren - it would probably good idea to explain your scenario a little
more in terms of structured vs un-structured datasets. Then people here can
give you better input on how you can use DF.



On Thu, Mar 3, 2016 at 9:43 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Plenty of people get their data in Parquet, Avro, or ORC files; or from a
> database; or do their initial loading of un- or semi-structured data using
> one of the various data source libraries
> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help
> with type-/schema-inference.
>
> All of these paths help you get to a DataFrame very quickly.
>
> Nick
>
> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:
>
> Dataframes are essentially structured tables with schemas. So where does
>> the non typed data sit before it becomes structured if not in a traditional
>> RDD?
>>
>> For us almost all the processing comes before there is structure to it.
>>
>>
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>> -------- Original message --------
>> From: Nicholas Chammas <nicholas.cham...@gmail.com>
>> Date: 03/02/2016 5:13 PM (GMT-05:00)
>> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
>>
>> Cc: user@spark.apache.org
>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
>> features
>>
>> > However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please chime
>> in).
>>
>> The more your workload uses DataFrames, the less of a difference there
>> will be between the languages (Scala, Java, Python, or R) in terms of
>> performance.
>>
>> One of the main benefits of Catalyst (which DFs enable) is that it
>> automatically optimizes DataFrame operations, letting you focus on _what_
>> you want while Spark will take care of figuring out _how_.
>>
>> Tungsten takes things further by tightly managing memory using the type
>> information made available to it via DataFrames. This benefit comes into
>> play regardless of the language used.
>>
>> So in short, DataFrames are the "new RDD"--i.e. the new base structure
>> you should be using in your Spark programs wherever possible. And with
>> DataFrames, what language you use matters much less in terms of performance.
>>
>> Nick
>>
>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
>>
>>> Hello Joshua,
>>>
>>> comments are inline...
>>>
>>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
>>>
>>> I haven't used Spark in the last year and a half. I am about to start a
>>> project with a new team, and we need to decide whether to use pyspark or
>>> Scala.
>>>
>>>
>>> Indeed, good questions, and they do come up lot in trainings that I have
>>> attended, where this inevitable question is raised.
>>> I believe, it depends on your level of comfort zone or adventure into
>>> newer things.
>>>
>>> True, for the most part that Apache Spark committers have been committed
>>> to keep the APIs at parity across all the language offerings, even though
>>> in some cases, in particular Python, they have lagged by a minor release.
>>> To the the extent that they’re committed to level-parity is a good sign. It
>>> might to be the case with some experimental APIs, where they lag behind,
>>>  but for the most part, they have been admirably consistent.
>>>
>>> With Python there’s a minor performance hit, since there’s an extra
>>> level of indirection in the architecture and an additional Python PID that
>>> the executors launch to execute your pickled Python lambdas. Other than
>>> that it boils down to your comfort zone. I recommend looking at Sameer’s
>>> slides on (Advanced Spark for DevOps Training) where he walks through the
>>> pySpark and Python architecture.
>>>
>>>
>>> We are NOT a java shop. So some of the build tools/procedures will
>>> require some learning overhead if we go the Scala route. What I want to
>>> know is: is the Scala version of Spark still far enough ahead of pyspark to
>>> be well worth any initial training overhead?
>>>
>>>
>>> If you are a very advanced Python shop and if you’ve in-house libraries
>>> that you have written in Python that don’t exist in Scala or some ML libs
>>> that don’t exist in the Scala version and will require fair amount of
>>> porting and gap is too large, then perhaps it makes sense to stay put with
>>> Python.
>>>
>>> However, I believe, investing (or having some members of your group)
>>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>>> the performance gain, especially now with Tungsten (not sure how it relates
>>> to Python, but some other knowledgeable people on the list, please chime
>>> in). Two, since Spark is written in Scala, it gives you an enormous
>>> advantage to read sources (which are well documented and highly readable)
>>> should you have to consult or learn nuances of certain API method or action
>>> not covered comprehensively in the docs. And finally, there’s a long term
>>> benefit in learning Scala for reasons other than Spark. For example,
>>> writing other scalable and distributed applications.
>>>
>>>
>>> Particularly, we will be using Spark Streaming. I know a couple of years
>>> ago that practically forced the decision to use Scala.  Is this still the
>>> case?
>>>
>>>
>>> You’ll notice that certain APIs call are not available, at least for
>>> now, in Python.
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>
>>>
>>> Cheers
>>> Jules
>>>
>>> --
>>> The Best Ideas Are Simple
>>> Jules S. Damji
>>> e-mail:dmat...@comcast.net
>>> e-mail:jules.da...@gmail.com
>>>
>>> 
>



-- 
Best Regards,
Ayan Guha

Re: Does pyspark still lag far behind the Scala API in terms of features

Reply via email to