Evan articulated it well.
On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks <evan.spa...@gmail.com> wrote: > Well, you can always join as many RDDs as you want by chaining them > together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of > RDDs in this way but 10 is probably doable. > > That said - SparkSQL has an optimizer under the covers that can make clever > decisions e.g. pushing the predicates in the WHERE clause down to the base > data (even to external data sources if you have them), ordering joins, and > choosing between join implementations (like using broadcast joins instead > of the default shuffle-based hash join in RDD.join). These decisions can > make your queries run orders of magnitude faster than they would if you > implemented them using basic RDD transformations. The best part is at this > stage, I'd expect the optimizer will continue to improve - meaning many of > your queries will get faster with each new release. > > I'm sure the SparkSQL devs can enumerate many other benefits - but as soon > as you're working with multiple tables and doing fairly textbook SQL stuff > - you likely want the engine figuring this stuff out for you rather than > hand coding it yourself. That said - with Spark, you can always drop back > to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to. > > On Thu, Feb 12, 2015 at 8:56 AM, vha14 <vh...@msn.com> wrote: > > > My team is building a batch data processing pipeline using Spark API and > > trying to understand if Spark SQL can help us. Below are what we found so > > far: > > > > - SQL's declarative style may be more readable in some cases (e.g. > joining > > of more than two RDDs), although some devs prefer the fluent style > > regardless. > > - Cogrouping of more than 4 RDDs is not supported and it's not clear if > > Spark SQL supports joining of arbitrary number of RDDs. > > - It seems that Spark SQL's features such as optimization based on > > predicate > > pushdown and dynamic schema inference are less applicable in a batch > > environment. > > > > Your inputs/suggestions are most welcome! > > > > Thanks, > > Vu Ha > > CTO, Semantic Scholar > > http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work > > > > > > > > -- > > View this message in context: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > >