Hot off the presses... Here's the closest we have to Python GraphX (and Cypher) support: https://databricks.com/blog/2016/03/03/introducing-graphframes.html
This was demo'd at Spark Summit NYC 2016. I'm migrating all of my GraphX code to this now. Reminder that GraphX is a batch graph analytics tool - and not a replacement for TitanDB/Gremlin/Neo4j transactional graph tools. In other words, don't put GraphX on your user's request/response hot path! (This is one of the most common misuses of GraphX I see.) Also, think of a DataFrame as a Dataset[Row]. This will help you bridge from "untyped" DataFrames to "typed" Datasets. On Thu, Mar 3, 2016 at 7:46 AM, Joshua Sorrell <jsor...@gmail.com> wrote: > Thank you, Jules, for your in depth answer. And thanks, everyone else, > for the additional info. This was very helpful. > > I think for proof of concept, we'll go with pyspark for dev speed. Then > we'll reevaluate from there. Any timeline for when GraphX will have python > support? > > On Wed, Mar 2, 2016 at 5:45 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> We’re veering off from the original question of this thread, but to >> clarify, my comment earlier was this: >> >> So in short, DataFrames are the “new RDD”—i.e. the new base structure you >> should be using in your Spark programs wherever possible. >> >> RDDs are not going away, and clearly in your case DataFrames are not that >> helpful, so sure, continue to use RDDs. There’s nothing wrong with that. >> No-one is saying you *must* use DataFrames, and Spark will continue to >> offer its RDD API. >> >> However, my original comment to Jules still stands: If you can, use >> DataFrames. In most cases they will offer you a better development >> experience and better performance across languages, and future Spark >> optimizations will mostly be enabled by the structure that DataFrames >> provide. >> >> DataFrames are the “new RDD” in the sense that they are the new >> foundation for much of the new work that has been done in recent versions >> and that is coming in Spark 2.0 and beyond. >> >> Many people work with semi-structured data and have a relatively easy >> path to DataFrames, as I explained in my previous email. If, however, >> you’re working with data that has very little structure, like in Darren’s >> case, then yes, DataFrames are probably not going to help that much. Stick >> with RDDs and you’ll be fine. >> >> >> On Wed, Mar 2, 2016 at 6:28 PM Darren Govoni <dar...@ontrenet.com> wrote: >> >>> Our data is made up of single text documents scraped off the web. We >>> store these in a RDD. A Dataframe or similar structure makes no sense at >>> that point. And the RDD is transient. >>> >>> So my point is. Dataframes should not replace plain old rdd since rdds >>> allow for more flexibility and sql etc is not even usable on our data while >>> in rdd. So all those nice dataframe apis aren't usable until it's >>> structured. Which is the core problem anyway. >>> >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> >>> -------- Original message -------- >>> From: Nicholas Chammas <nicholas.cham...@gmail.com> >>> Date: 03/02/2016 5:43 PM (GMT-05:00) >>> To: Darren Govoni <dar...@ontrenet.com>, Jules Damji < >>> dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com> >>> Cc: user@spark.apache.org >>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of >>> features >>> >>> Plenty of people get their data in Parquet, Avro, or ORC files; or from >>> a database; or do their initial loading of un- or semi-structured data >>> using one of the various data source libraries >>> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help >>> with type-/schema-inference. >>> >>> All of these paths help you get to a DataFrame very quickly. >>> >>> Nick >>> >>> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> >>> wrote: >>> >>> Dataframes are essentially structured tables with schemas. So where does >>>> the non typed data sit before it becomes structured if not in a traditional >>>> RDD? >>>> >>>> For us almost all the processing comes before there is structure to it. >>>> >>>> >>>> >>>> >>>> >>>> Sent from my Verizon Wireless 4G LTE smartphone >>>> >>>> >>>> -------- Original message -------- >>>> From: Nicholas Chammas <nicholas.cham...@gmail.com> >>>> Date: 03/02/2016 5:13 PM (GMT-05:00) >>>> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell < >>>> jsor...@gmail.com> >>>> Cc: user@spark.apache.org >>>> Subject: Re: Does pyspark still lag far behind the Scala API in terms >>>> of features >>>> >>>> > However, I believe, investing (or having some members of your group) >>>> learn and invest in Scala is worthwhile for few reasons. One, you will get >>>> the performance gain, especially now with Tungsten (not sure how it relates >>>> to Python, but some other knowledgeable people on the list, please chime >>>> in). >>>> >>>> The more your workload uses DataFrames, the less of a difference there >>>> will be between the languages (Scala, Java, Python, or R) in terms of >>>> performance. >>>> >>>> One of the main benefits of Catalyst (which DFs enable) is that it >>>> automatically optimizes DataFrame operations, letting you focus on _what_ >>>> you want while Spark will take care of figuring out _how_. >>>> >>>> Tungsten takes things further by tightly managing memory using the type >>>> information made available to it via DataFrames. This benefit comes into >>>> play regardless of the language used. >>>> >>>> So in short, DataFrames are the "new RDD"--i.e. the new base structure >>>> you should be using in your Spark programs wherever possible. And with >>>> DataFrames, what language you use matters much less in terms of >>>> performance. >>>> >>>> Nick >>>> >>>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> >>>> wrote: >>>> >>>>> Hello Joshua, >>>>> >>>>> comments are inline... >>>>> >>>>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote: >>>>> >>>>> I haven't used Spark in the last year and a half. I am about to start >>>>> a project with a new team, and we need to decide whether to use pyspark or >>>>> Scala. >>>>> >>>>> >>>>> Indeed, good questions, and they do come up lot in trainings that I >>>>> have attended, where this inevitable question is raised. >>>>> I believe, it depends on your level of comfort zone or adventure into >>>>> newer things. >>>>> >>>>> True, for the most part that Apache Spark committers have been >>>>> committed to keep the APIs at parity across all the language offerings, >>>>> even though in some cases, in particular Python, they have lagged by a >>>>> minor release. To the the extent that they’re committed to level-parity is >>>>> a good sign. It might to be the case with some experimental APIs, where >>>>> they lag behind, but for the most part, they have been admirably >>>>> consistent. >>>>> >>>>> With Python there’s a minor performance hit, since there’s an extra >>>>> level of indirection in the architecture and an additional Python PID that >>>>> the executors launch to execute your pickled Python lambdas. Other than >>>>> that it boils down to your comfort zone. I recommend looking at Sameer’s >>>>> slides on (Advanced Spark for DevOps Training) where he walks through the >>>>> pySpark and Python architecture. >>>>> >>>>> >>>>> We are NOT a java shop. So some of the build tools/procedures will >>>>> require some learning overhead if we go the Scala route. What I want to >>>>> know is: is the Scala version of Spark still far enough ahead of pyspark >>>>> to >>>>> be well worth any initial training overhead? >>>>> >>>>> >>>>> If you are a very advanced Python shop and if you’ve in-house >>>>> libraries that you have written in Python that don’t exist in Scala or >>>>> some >>>>> ML libs that don’t exist in the Scala version and will require fair amount >>>>> of porting and gap is too large, then perhaps it makes sense to stay put >>>>> with Python. >>>>> >>>>> However, I believe, investing (or having some members of your group) >>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get >>>>> the performance gain, especially now with Tungsten (not sure how it >>>>> relates >>>>> to Python, but some other knowledgeable people on the list, please chime >>>>> in). Two, since Spark is written in Scala, it gives you an enormous >>>>> advantage to read sources (which are well documented and highly readable) >>>>> should you have to consult or learn nuances of certain API method or >>>>> action >>>>> not covered comprehensively in the docs. And finally, there’s a long term >>>>> benefit in learning Scala for reasons other than Spark. For example, >>>>> writing other scalable and distributed applications. >>>>> >>>>> >>>>> Particularly, we will be using Spark Streaming. I know a couple of >>>>> years ago that practically forced the decision to use Scala. Is this >>>>> still >>>>> the case? >>>>> >>>>> >>>>> You’ll notice that certain APIs call are not available, at least for >>>>> now, in Python. >>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>>>> >>>>> >>>>> Cheers >>>>> Jules >>>>> >>>>> -- >>>>> The Best Ideas Are Simple >>>>> Jules S. Damji >>>>> e-mail:dmat...@comcast.net >>>>> e-mail:jules.da...@gmail.com >>>>> >>>>> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com