Plenty of people get their data in Parquet, Avro, or ORC files; or from a database; or do their initial loading of un- or semi-structured data using one of the various data source libraries <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help with type-/schema-inference.
All of these paths help you get to a DataFrame very quickly. Nick On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote: Dataframes are essentially structured tables with schemas. So where does > the non typed data sit before it becomes structured if not in a traditional > RDD? > > For us almost all the processing comes before there is structure to it. > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > -------- Original message -------- > From: Nicholas Chammas <nicholas.cham...@gmail.com> > Date: 03/02/2016 5:13 PM (GMT-05:00) > To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com> > Cc: user@spark.apache.org > Subject: Re: Does pyspark still lag far behind the Scala API in terms of > features > > > However, I believe, investing (or having some members of your group) > learn and invest in Scala is worthwhile for few reasons. One, you will get > the performance gain, especially now with Tungsten (not sure how it relates > to Python, but some other knowledgeable people on the list, please chime > in). > > The more your workload uses DataFrames, the less of a difference there > will be between the languages (Scala, Java, Python, or R) in terms of > performance. > > One of the main benefits of Catalyst (which DFs enable) is that it > automatically optimizes DataFrame operations, letting you focus on _what_ > you want while Spark will take care of figuring out _how_. > > Tungsten takes things further by tightly managing memory using the type > information made available to it via DataFrames. This benefit comes into > play regardless of the language used. > > So in short, DataFrames are the "new RDD"--i.e. the new base structure you > should be using in your Spark programs wherever possible. And with > DataFrames, what language you use matters much less in terms of performance. > > Nick > > On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote: > >> Hello Joshua, >> >> comments are inline... >> >> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote: >> >> I haven't used Spark in the last year and a half. I am about to start a >> project with a new team, and we need to decide whether to use pyspark or >> Scala. >> >> >> Indeed, good questions, and they do come up lot in trainings that I have >> attended, where this inevitable question is raised. >> I believe, it depends on your level of comfort zone or adventure into >> newer things. >> >> True, for the most part that Apache Spark committers have been committed >> to keep the APIs at parity across all the language offerings, even though >> in some cases, in particular Python, they have lagged by a minor release. >> To the the extent that they’re committed to level-parity is a good sign. It >> might to be the case with some experimental APIs, where they lag behind, >> but for the most part, they have been admirably consistent. >> >> With Python there’s a minor performance hit, since there’s an extra level >> of indirection in the architecture and an additional Python PID that the >> executors launch to execute your pickled Python lambdas. Other than that it >> boils down to your comfort zone. I recommend looking at Sameer’s slides on >> (Advanced Spark for DevOps Training) where he walks through the pySpark and >> Python architecture. >> >> >> We are NOT a java shop. So some of the build tools/procedures will >> require some learning overhead if we go the Scala route. What I want to >> know is: is the Scala version of Spark still far enough ahead of pyspark to >> be well worth any initial training overhead? >> >> >> If you are a very advanced Python shop and if you’ve in-house libraries >> that you have written in Python that don’t exist in Scala or some ML libs >> that don’t exist in the Scala version and will require fair amount of >> porting and gap is too large, then perhaps it makes sense to stay put with >> Python. >> >> However, I believe, investing (or having some members of your group) >> learn and invest in Scala is worthwhile for few reasons. One, you will get >> the performance gain, especially now with Tungsten (not sure how it relates >> to Python, but some other knowledgeable people on the list, please chime >> in). Two, since Spark is written in Scala, it gives you an enormous >> advantage to read sources (which are well documented and highly readable) >> should you have to consult or learn nuances of certain API method or action >> not covered comprehensively in the docs. And finally, there’s a long term >> benefit in learning Scala for reasons other than Spark. For example, >> writing other scalable and distributed applications. >> >> >> Particularly, we will be using Spark Streaming. I know a couple of years >> ago that practically forced the decision to use Scala. Is this still the >> case? >> >> >> You’ll notice that certain APIs call are not available, at least for now, >> in Python. >> http://spark.apache.org/docs/latest/streaming-programming-guide.html >> >> >> Cheers >> Jules >> >> -- >> The Best Ideas Are Simple >> Jules S. Damji >> e-mail:dmat...@comcast.net >> e-mail:jules.da...@gmail.com >> >>