Re: new Catalyst/SQL component merged into master

Michael Armbrust Mon, 24 Mar 2014 13:37:07 -0700

Hi Evan,

Index support is definitely something we would like to add, and it is
possible that adding support for your custom indexing solution would not be
too difficult.

We already push predicates into hive table scan operators when the
predicates are over partition keys.  You can see an example of how we
collect filters and decide which can pushed into the scan using the
HiveTableScan
query planning 
strategy<https://github.com/marmbrus/spark/blob/0ae86cfcba56b700d8e7bd869379f0c663b21c1e/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L56>
.

I'd like to know more about your indexing solution.  Is this something
publicly available?  One concern here is that the query planning code is
not considered a public API and so is likely to change quite a bit as we
improve the optimizer.  Its not currently something that we plan to expose
for external components to modify.

Michael

On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan <e...@ooyala.com> wrote:

> Hi Michael,
>
> Congrats, this is really neat!
>
> What thoughts do you have regarding adding indexing support and
> predicate pushdown to this SQL framework?    Right now we have custom
> bitmap indexing to speed up queries, so we're really curious as far as
> the architectural direction.
>
> -Evan
>
>
> On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
> <mich...@databricks.com> wrote:
> >>
> >> It will be great if there are any examples or usecases to look at ?
> >>
> > There are examples in the Spark documentation.  Patrick posted and
> updated
> > copy here so people can see them before 1.0 is released:
> >
> http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
> >
> >> Does this feature has different usecases than shark or more cleaner as
> >> hive dependency is gone?
> >>
> > Depending on how you use this, there is still a dependency on Hive (By
> > default this is not the case.  See the above documentation for more
> > details).  However, the dependency is on a stock version of Hive instead
> of
> > one modified by the AMPLab.  Furthermore, Spark SQL has its own
> optimizer,
> > instead of relying on the Hive optimizer.  Long term, this is going to
> give
> > us a lot more flexibility to optimize queries specifically for the Spark
> > execution engine.  We are actively porting over the best parts of shark
> > (specifically the in-memory columnar representation).
> >
> > Shark still has some features that are missing in Spark SQL, including
> > SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
> > status, it'll likely become the new backend for Shark.
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>

Re: new Catalyst/SQL component merged into master

Reply via email to