Re: Does pyspark still lag far behind the Scala API in terms of features

Chris Fregly Fri, 04 Mar 2016 15:22:37 -0800

Hot off the presses...

Here's the closest we have to Python GraphX (and Cypher) support:
https://databricks.com/blog/2016/03/03/introducing-graphframes.html


This was demo'd at Spark Summit NYC 2016.  I'm migrating all of my GraphX
code to this now.

Reminder that GraphX is a batch graph analytics tool - and not a
replacement for TitanDB/Gremlin/Neo4j transactional graph tools.

In other words, don't put GraphX on your user's request/response hot path!

(This is one of the most common misuses of GraphX I see.)

Also, think of a DataFrame as a Dataset[Row].  This will help you bridge
from "untyped" DataFrames to "typed" Datasets.


On Thu, Mar 3, 2016 at 7:46 AM, Joshua Sorrell <jsor...@gmail.com> wrote:

> Thank you, Jules, for your in depth answer.  And thanks, everyone else,
> for the additional info. This was very helpful.
>
> I think for proof of concept, we'll go with pyspark for dev speed. Then
> we'll reevaluate from there. Any timeline for when GraphX will have python
> support?
>
> On Wed, Mar 2, 2016 at 5:45 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> We’re veering off from the original question of this thread, but to
>> clarify, my comment earlier was this:
>>
>> So in short, DataFrames are the “new RDD”—i.e. the new base structure you
>> should be using in your Spark programs wherever possible.
>>
>> RDDs are not going away, and clearly in your case DataFrames are not that
>> helpful, so sure, continue to use RDDs. There’s nothing wrong with that.
>> No-one is saying you *must* use DataFrames, and Spark will continue to
>> offer its RDD API.
>>
>> However, my original comment to Jules still stands: If you can, use
>> DataFrames. In most cases they will offer you a better development
>> experience and better performance across languages, and future Spark
>> optimizations will mostly be enabled by the structure that DataFrames
>> provide.
>>
>> DataFrames are the “new RDD” in the sense that they are the new
>> foundation for much of the new work that has been done in recent versions
>> and that is coming in Spark 2.0 and beyond.
>>
>> Many people work with semi-structured data and have a relatively easy
>> path to DataFrames, as I explained in my previous email. If, however,
>> you’re working with data that has very little structure, like in Darren’s
>> case, then yes, DataFrames are probably not going to help that much. Stick
>> with RDDs and you’ll be fine.
>> 
>>
>> On Wed, Mar 2, 2016 at 6:28 PM Darren Govoni <dar...@ontrenet.com> wrote:
>>
>>> Our data is made up of single text documents scraped off the web. We
>>> store these in a  RDD. A Dataframe or similar structure makes no sense at
>>> that point. And the RDD is transient.
>>>
>>> So my point is. Dataframes should not replace plain old rdd since rdds
>>> allow for more flexibility and sql etc is not even usable on our data while
>>> in rdd. So all those nice dataframe apis aren't usable until it's
>>> structured. Which is the core problem anyway.
>>>
>>>
>>>
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>
>>>
>>> -------- Original message --------
>>> From: Nicholas Chammas <nicholas.cham...@gmail.com>
>>> Date: 03/02/2016 5:43 PM (GMT-05:00)
>>> To: Darren Govoni <dar...@ontrenet.com>, Jules Damji <
>>> dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
>>> Cc: user@spark.apache.org
>>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
>>> features
>>>
>>> Plenty of people get their data in Parquet, Avro, or ORC files; or from
>>> a database; or do their initial loading of un- or semi-structured data
>>> using one of the various data source libraries
>>> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help
>>> with type-/schema-inference.
>>>
>>> All of these paths help you get to a DataFrame very quickly.
>>>
>>> Nick
>>>
>>> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com>
>>> wrote:
>>>
>>> Dataframes are essentially structured tables with schemas. So where does
>>>> the non typed data sit before it becomes structured if not in a traditional
>>>> RDD?
>>>>
>>>> For us almost all the processing comes before there is structure to it.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>
>>>>
>>>> -------- Original message --------
>>>> From: Nicholas Chammas <nicholas.cham...@gmail.com>
>>>> Date: 03/02/2016 5:13 PM (GMT-05:00)
>>>> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <
>>>> jsor...@gmail.com>
>>>> Cc: user@spark.apache.org
>>>> Subject: Re: Does pyspark still lag far behind the Scala API in terms
>>>> of features
>>>>
>>>> > However, I believe, investing (or having some members of your group)
>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>>>> the performance gain, especially now with Tungsten (not sure how it relates
>>>> to Python, but some other knowledgeable people on the list, please chime
>>>> in).
>>>>
>>>> The more your workload uses DataFrames, the less of a difference there
>>>> will be between the languages (Scala, Java, Python, or R) in terms of
>>>> performance.
>>>>
>>>> One of the main benefits of Catalyst (which DFs enable) is that it
>>>> automatically optimizes DataFrame operations, letting you focus on _what_
>>>> you want while Spark will take care of figuring out _how_.
>>>>
>>>> Tungsten takes things further by tightly managing memory using the type
>>>> information made available to it via DataFrames. This benefit comes into
>>>> play regardless of the language used.
>>>>
>>>> So in short, DataFrames are the "new RDD"--i.e. the new base structure
>>>> you should be using in your Spark programs wherever possible. And with
>>>> DataFrames, what language you use matters much less in terms of 
>>>> performance.
>>>>
>>>> Nick
>>>>
>>>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net>
>>>> wrote:
>>>>
>>>>> Hello Joshua,
>>>>>
>>>>> comments are inline...
>>>>>
>>>>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
>>>>>
>>>>> I haven't used Spark in the last year and a half. I am about to start
>>>>> a project with a new team, and we need to decide whether to use pyspark or
>>>>> Scala.
>>>>>
>>>>>
>>>>> Indeed, good questions, and they do come up lot in trainings that I
>>>>> have attended, where this inevitable question is raised.
>>>>> I believe, it depends on your level of comfort zone or adventure into
>>>>> newer things.
>>>>>
>>>>> True, for the most part that Apache Spark committers have been
>>>>> committed to keep the APIs at parity across all the language offerings,
>>>>> even though in some cases, in particular Python, they have lagged by a
>>>>> minor release. To the the extent that they’re committed to level-parity is
>>>>> a good sign. It might to be the case with some experimental APIs, where
>>>>> they lag behind,  but for the most part, they have been admirably
>>>>> consistent.
>>>>>
>>>>> With Python there’s a minor performance hit, since there’s an extra
>>>>> level of indirection in the architecture and an additional Python PID that
>>>>> the executors launch to execute your pickled Python lambdas. Other than
>>>>> that it boils down to your comfort zone. I recommend looking at Sameer’s
>>>>> slides on (Advanced Spark for DevOps Training) where he walks through the
>>>>> pySpark and Python architecture.
>>>>>
>>>>>
>>>>> We are NOT a java shop. So some of the build tools/procedures will
>>>>> require some learning overhead if we go the Scala route. What I want to
>>>>> know is: is the Scala version of Spark still far enough ahead of pyspark 
>>>>> to
>>>>> be well worth any initial training overhead?
>>>>>
>>>>>
>>>>> If you are a very advanced Python shop and if you’ve in-house
>>>>> libraries that you have written in Python that don’t exist in Scala or 
>>>>> some
>>>>> ML libs that don’t exist in the Scala version and will require fair amount
>>>>> of porting and gap is too large, then perhaps it makes sense to stay put
>>>>> with Python.
>>>>>
>>>>> However, I believe, investing (or having some members of your group)
>>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>>>>> the performance gain, especially now with Tungsten (not sure how it 
>>>>> relates
>>>>> to Python, but some other knowledgeable people on the list, please chime
>>>>> in). Two, since Spark is written in Scala, it gives you an enormous
>>>>> advantage to read sources (which are well documented and highly readable)
>>>>> should you have to consult or learn nuances of certain API method or 
>>>>> action
>>>>> not covered comprehensively in the docs. And finally, there’s a long term
>>>>> benefit in learning Scala for reasons other than Spark. For example,
>>>>> writing other scalable and distributed applications.
>>>>>
>>>>>
>>>>> Particularly, we will be using Spark Streaming. I know a couple of
>>>>> years ago that practically forced the decision to use Scala.  Is this 
>>>>> still
>>>>> the case?
>>>>>
>>>>>
>>>>> You’ll notice that certain APIs call are not available, at least for
>>>>> now, in Python.
>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>>>
>>>>>
>>>>> Cheers
>>>>> Jules
>>>>>
>>>>> --
>>>>> The Best Ideas Are Simple
>>>>> Jules S. Damji
>>>>> e-mail:dmat...@comcast.net
>>>>> e-mail:jules.da...@gmail.com
>>>>>
>>>>>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Re: Does pyspark still lag far behind the Scala API in terms of features

Reply via email to