+1 on all the pointers. @Darren - it would probably good idea to explain your scenario a little more in terms of structured vs un-structured datasets. Then people here can give you better input on how you can use DF.
On Thu, Mar 3, 2016 at 9:43 AM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Plenty of people get their data in Parquet, Avro, or ORC files; or from a > database; or do their initial loading of un- or semi-structured data using > one of the various data source libraries > <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help > with type-/schema-inference. > > All of these paths help you get to a DataFrame very quickly. > > Nick > > On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote: > > Dataframes are essentially structured tables with schemas. So where does >> the non typed data sit before it becomes structured if not in a traditional >> RDD? >> >> For us almost all the processing comes before there is structure to it. >> >> >> >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> >> -------- Original message -------- >> From: Nicholas Chammas <nicholas.cham...@gmail.com> >> Date: 03/02/2016 5:13 PM (GMT-05:00) >> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com> >> >> Cc: user@spark.apache.org >> Subject: Re: Does pyspark still lag far behind the Scala API in terms of >> features >> >> > However, I believe, investing (or having some members of your group) >> learn and invest in Scala is worthwhile for few reasons. One, you will get >> the performance gain, especially now with Tungsten (not sure how it relates >> to Python, but some other knowledgeable people on the list, please chime >> in). >> >> The more your workload uses DataFrames, the less of a difference there >> will be between the languages (Scala, Java, Python, or R) in terms of >> performance. >> >> One of the main benefits of Catalyst (which DFs enable) is that it >> automatically optimizes DataFrame operations, letting you focus on _what_ >> you want while Spark will take care of figuring out _how_. >> >> Tungsten takes things further by tightly managing memory using the type >> information made available to it via DataFrames. This benefit comes into >> play regardless of the language used. >> >> So in short, DataFrames are the "new RDD"--i.e. the new base structure >> you should be using in your Spark programs wherever possible. And with >> DataFrames, what language you use matters much less in terms of performance. >> >> Nick >> >> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote: >> >>> Hello Joshua, >>> >>> comments are inline... >>> >>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote: >>> >>> I haven't used Spark in the last year and a half. I am about to start a >>> project with a new team, and we need to decide whether to use pyspark or >>> Scala. >>> >>> >>> Indeed, good questions, and they do come up lot in trainings that I have >>> attended, where this inevitable question is raised. >>> I believe, it depends on your level of comfort zone or adventure into >>> newer things. >>> >>> True, for the most part that Apache Spark committers have been committed >>> to keep the APIs at parity across all the language offerings, even though >>> in some cases, in particular Python, they have lagged by a minor release. >>> To the the extent that they’re committed to level-parity is a good sign. It >>> might to be the case with some experimental APIs, where they lag behind, >>> but for the most part, they have been admirably consistent. >>> >>> With Python there’s a minor performance hit, since there’s an extra >>> level of indirection in the architecture and an additional Python PID that >>> the executors launch to execute your pickled Python lambdas. Other than >>> that it boils down to your comfort zone. I recommend looking at Sameer’s >>> slides on (Advanced Spark for DevOps Training) where he walks through the >>> pySpark and Python architecture. >>> >>> >>> We are NOT a java shop. So some of the build tools/procedures will >>> require some learning overhead if we go the Scala route. What I want to >>> know is: is the Scala version of Spark still far enough ahead of pyspark to >>> be well worth any initial training overhead? >>> >>> >>> If you are a very advanced Python shop and if you’ve in-house libraries >>> that you have written in Python that don’t exist in Scala or some ML libs >>> that don’t exist in the Scala version and will require fair amount of >>> porting and gap is too large, then perhaps it makes sense to stay put with >>> Python. >>> >>> However, I believe, investing (or having some members of your group) >>> learn and invest in Scala is worthwhile for few reasons. One, you will get >>> the performance gain, especially now with Tungsten (not sure how it relates >>> to Python, but some other knowledgeable people on the list, please chime >>> in). Two, since Spark is written in Scala, it gives you an enormous >>> advantage to read sources (which are well documented and highly readable) >>> should you have to consult or learn nuances of certain API method or action >>> not covered comprehensively in the docs. And finally, there’s a long term >>> benefit in learning Scala for reasons other than Spark. For example, >>> writing other scalable and distributed applications. >>> >>> >>> Particularly, we will be using Spark Streaming. I know a couple of years >>> ago that practically forced the decision to use Scala. Is this still the >>> case? >>> >>> >>> You’ll notice that certain APIs call are not available, at least for >>> now, in Python. >>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>> >>> >>> Cheers >>> Jules >>> >>> -- >>> The Best Ideas Are Simple >>> Jules S. Damji >>> e-mail:dmat...@comcast.net >>> e-mail:jules.da...@gmail.com >>> >>> > -- Best Regards, Ayan Guha