Hello, guys. I have implemented basic support of Spark Data Frame API [1], [2] for Ignite. Spark provides API for a custom strategy to optimize queries from spark to underlying data source(Ignite).
The goal of optimization(obvious, just to be on the same page): Minimize data transfer between Spark and Ignite. Speedup query execution. I see 3 ways to optimize queries: 1. *Join Reduce* If one make some query that join two or more Ignite tables, we have to pass all join info to Ignite and transfer to Spark only result of table join. To implement it we have to extend current implementation with new RelationProvider that can generate all kind of joins for two or more tables. We should add some tests, also. The question is - how join result should be partitioned? 2. *Order by* If one make some query to Ignite table with order by clause we can execute sorting on Ignite side. But it seems that currently Spark doesn’t have any way to tell that partitions already sorted. 3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE key IN (X, Y, Z)`, we can reduce number of partitions. And query only partitions that store certain key values. Is this kind of optimization already built in Ignite or I should implement it by myself? May be, there is any other way to make queries run faster? [1] https://spark.apache.org/docs/latest/sql-programming-guide.html [2] https://github.com/apache/ignite/pull/2742