Re: Optimization of SQL queries from Spark Data Frame to Ignite

Николай Ижиков Wed, 29 Nov 2017 03:20:13 -0800

Valentin,

> process the AST generated by Spark and convert it to Ignite SQL...
> Does it make sense to you?


Yes.
I think it is a great approach.

Let's implement such feature as the second step of Data Frame integration.

2017-11-29 3:23 GMT+03:00 Valentin Kulichenko <valentin.kuliche...@gmail.com
>:

> Nikolay,
>
> Custom strategy allows to fully process the AST generated by Spark and
> convert it to Ignite SQL, so there will be no execution on Spark side at
> all. This is what we are trying to achieve here. Basically, one will be
> able to use DataFrame API to execute queries directly on Ignite. Does it
> make sense to you?
>
> I would recommend you to take a look at MemSQL implementation which does
> similar stuff: https://github.com/memsql/memsql-spark-connector
>
> Note that this approach will work only if all relations included in AST are
> Ignite tables. Otherwise, strategy should return null so that Spark falls
> back to its regular mode. Ignite will be used as regular data source in
> this case, and probably it's possible to implement some optimizations here
> as well. However, I never investigated this and it seems like another
> separate discussion.
>
> -Val
>
> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <nizhikov....@gmail.com>
> wrote:
>
> > Hello, guys.
> >
> > I have implemented basic support of Spark Data Frame API [1], [2] for
> > Ignite.
> > Spark provides API for a custom strategy to optimize queries from spark
> to
> > underlying data source(Ignite).
> >
> > The goal of optimization(obvious, just to be on the same page):
> > Minimize data transfer between Spark and Ignite.
> > Speedup query execution.
> >
> > I see 3 ways to optimize queries:
> >
> >         1. *Join Reduce* If one make some query that join two or more
> > Ignite tables, we have to pass all join info to Ignite and transfer to
> > Spark only result of table join.
> >         To implement it we have to extend current implementation with new
> > RelationProvider that can generate all kind of joins for two or more
> tables.
> >         We should add some tests, also.
> >         The question is - how join result should be partitioned?
> >
> >
> >         2. *Order by* If one make some query to Ignite table with order
> by
> > clause we can execute sorting on Ignite side.
> >         But it seems that currently Spark doesn’t have any way to tell
> > that partitions already sorted.
> >
> >
> >         3. *Key filter* If one make query with `WHERE key = XXX` or
> `WHERE
> > key IN (X, Y, Z)`, we can reduce number of partitions.
> >         And query only partitions that store certain key values.
> >         Is this kind of optimization already built in Ignite or I should
> > implement it by myself?
> >
> > May be, there is any other way to make queries run faster?
> >
> > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > [2] https://github.com/apache/ignite/pull/2742
> >
>



-- 
Nikolay Izhikov
nizhikov....@gmail.com

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to