Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-03 Thread Joshua Sorrell
Thank you, Jules, for your in depth answer. And thanks, everyone else, for the additional info. This was very helpful. I think for proof of concept, we'll go with pyspark for dev speed. Then we'll reevaluate from there. Any timeline for when GraphX will have python support? On Wed, Mar 2, 2016

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to clarify, my comment earlier was this: So in short, DataFrames are the “new RDD”—i.e. the new base structure you should be using in your Spark programs wherever possible. RDDs are not going away, and clearly in your case

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Our data is made up of single text documents scraped off the web. We store these in a  RDD. A Dataframe or similar structure makes no sense at that point. And the RDD is transient. So my point is. Dataframes should not replace plain old rdd since rdds allow for more flexibility and sql

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread ayan guha
+1 on all the pointers. @Darren - it would probably good idea to explain your scenario a little more in terms of structured vs un-structured datasets. Then people here can give you better input on how you can use DF. On Thu, Mar 3, 2016 at 9:43 AM, Nicholas Chammas

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a database; or do their initial loading of un- or semi-structured data using one of the various data source libraries which help with type-/schema-inference. All of

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Dataframes are essentially structured tables with schemas. So where does the non typed data sit before it becomes structured if not in a traditional RDD? For us almost all the processing comes before there is structure to it. Sent from my Verizon Wireless 4G LTE smartphone

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group) learn and invest in Scala is worthwhile for few reasons. One, you will get the performance gain, especially now with Tungsten (not sure how it relates to Python, but some other knowledgeable people on the list, please chime

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-01 Thread Jules Damji
Hello Joshua, comments are inline... > On Mar 1, 2016, at 5:03 AM, Joshua Sorrell wrote: > > I haven't used Spark in the last year and a half. I am about to start a > project with a new team, and we need to decide whether to use pyspark or > Scala. Indeed, good questions,

Does pyspark still lag far behind the Scala API in terms of features

2016-03-01 Thread Joshua Sorrell
I haven't used Spark in the last year and a half. I am about to start a project with a new team, and we need to decide whether to use pyspark or Scala. We are NOT a java shop. So some of the build tools/procedures will require some learning overhead if we go the Scala route. What I want to know