Re: Optimization of SQL queries from Spark Data Frame to Ignite

Valentin Kulichenko Tue, 28 Nov 2017 16:24:35 -0800

Nikolay,

Custom strategy allows to fully process the AST generated by Spark and
convert it to Ignite SQL, so there will be no execution on Spark side at
all. This is what we are trying to achieve here. Basically, one will be
able to use DataFrame API to execute queries directly on Ignite. Does it
make sense to you?


I would recommend you to take a look at MemSQL implementation which does
similar stuff: https://github.com/memsql/memsql-spark-connector

Note that this approach will work only if all relations included in AST are
Ignite tables. Otherwise, strategy should return null so that Spark falls
back to its regular mode. Ignite will be used as regular data source in
this case, and probably it's possible to implement some optimizations here
as well. However, I never investigated this and it seems like another
separate discussion.

-Val

On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <nizhikov....@gmail.com>
wrote:

> Hello, guys.
>
> I have implemented basic support of Spark Data Frame API [1], [2] for
> Ignite.
> Spark provides API for a custom strategy to optimize queries from spark to
> underlying data source(Ignite).
>
> The goal of optimization(obvious, just to be on the same page):
> Minimize data transfer between Spark and Ignite.
> Speedup query execution.
>
> I see 3 ways to optimize queries:
>
>         1. *Join Reduce* If one make some query that join two or more
> Ignite tables, we have to pass all join info to Ignite and transfer to
> Spark only result of table join.
>         To implement it we have to extend current implementation with new
> RelationProvider that can generate all kind of joins for two or more tables.
>         We should add some tests, also.
>         The question is - how join result should be partitioned?
>
>
>         2. *Order by* If one make some query to Ignite table with order by
> clause we can execute sorting on Ignite side.
>         But it seems that currently Spark doesn’t have any way to tell
> that partitions already sorted.
>
>
>         3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE
> key IN (X, Y, Z)`, we can reduce number of partitions.
>         And query only partitions that store certain key values.
>         Is this kind of optimization already built in Ignite or I should
> implement it by myself?
>
> May be, there is any other way to make queries run faster?
>
> [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> [2] https://github.com/apache/ignite/pull/2742
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to