Valentin, > process the AST generated by Spark and convert it to Ignite SQL... > Does it make sense to you?
Yes. I think it is a great approach. Let's implement such feature as the second step of Data Frame integration. 2017-11-29 3:23 GMT+03:00 Valentin Kulichenko <valentin.kuliche...@gmail.com >: > Nikolay, > > Custom strategy allows to fully process the AST generated by Spark and > convert it to Ignite SQL, so there will be no execution on Spark side at > all. This is what we are trying to achieve here. Basically, one will be > able to use DataFrame API to execute queries directly on Ignite. Does it > make sense to you? > > I would recommend you to take a look at MemSQL implementation which does > similar stuff: https://github.com/memsql/memsql-spark-connector > > Note that this approach will work only if all relations included in AST are > Ignite tables. Otherwise, strategy should return null so that Spark falls > back to its regular mode. Ignite will be used as regular data source in > this case, and probably it's possible to implement some optimizations here > as well. However, I never investigated this and it seems like another > separate discussion. > > -Val > > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <nizhikov....@gmail.com> > wrote: > > > Hello, guys. > > > > I have implemented basic support of Spark Data Frame API [1], [2] for > > Ignite. > > Spark provides API for a custom strategy to optimize queries from spark > to > > underlying data source(Ignite). > > > > The goal of optimization(obvious, just to be on the same page): > > Minimize data transfer between Spark and Ignite. > > Speedup query execution. > > > > I see 3 ways to optimize queries: > > > > 1. *Join Reduce* If one make some query that join two or more > > Ignite tables, we have to pass all join info to Ignite and transfer to > > Spark only result of table join. > > To implement it we have to extend current implementation with new > > RelationProvider that can generate all kind of joins for two or more > tables. > > We should add some tests, also. > > The question is - how join result should be partitioned? > > > > > > 2. *Order by* If one make some query to Ignite table with order > by > > clause we can execute sorting on Ignite side. > > But it seems that currently Spark doesn’t have any way to tell > > that partitions already sorted. > > > > > > 3. *Key filter* If one make query with `WHERE key = XXX` or > `WHERE > > key IN (X, Y, Z)`, we can reduce number of partitions. > > And query only partitions that store certain key values. > > Is this kind of optimization already built in Ignite or I should > > implement it by myself? > > > > May be, there is any other way to make queries run faster? > > > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html > > [2] https://github.com/apache/ignite/pull/2742 > > > -- Nikolay Izhikov nizhikov....@gmail.com