Re: Spark SQL value proposition in batch pipelines

Reynold Xin Thu, 12 Feb 2015 11:43:04 -0800

Evan articulated it well.


On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

> Well, you can always join as many RDDs as you want by chaining them
> together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
> RDDs in this way but 10 is probably doable.
>
> That said - SparkSQL has an optimizer under the covers that can make clever
> decisions e.g. pushing the predicates in the WHERE clause down to the base
> data (even to external data sources if you have them), ordering joins, and
> choosing between join implementations (like using broadcast joins instead
> of the default shuffle-based hash join in RDD.join). These decisions can
> make your queries run orders of magnitude faster than they would if you
> implemented them using basic RDD transformations. The best part is at this
> stage, I'd expect the optimizer will continue to improve - meaning many of
> your queries will get faster with each new release.
>
> I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
> as you're working with multiple tables and doing fairly textbook SQL stuff
> - you likely want the engine figuring this stuff out for you rather than
> hand coding it yourself. That said - with Spark, you can always drop back
> to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.
>
> On Thu, Feb 12, 2015 at 8:56 AM, vha14 <vh...@msn.com> wrote:
>
> > My team is building a batch data processing pipeline using Spark API and
> > trying to understand if Spark SQL can help us. Below are what we found so
> > far:
> >
> > - SQL's declarative style may be more readable in some cases (e.g.
> joining
> > of more than two RDDs), although some devs prefer the fluent style
> > regardless.
> > - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> > Spark SQL supports joining of arbitrary number of RDDs.
> > - It seems that Spark SQL's features such as optimization based on
> > predicate
> > pushdown and dynamic schema inference are less applicable in a batch
> > environment.
> >
> > Your inputs/suggestions are most welcome!
> >
> > Thanks,
> > Vu Ha
> > CTO, Semantic Scholar
> > http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: Spark SQL value proposition in batch pipelines

Reply via email to