Great! Let me know if you need any assistance and/or intermediate review. -Val
On Thu, Nov 30, 2017 at 12:05 AM, Николай Ижиков <nizhikov....@gmail.com> wrote: > Valentin, > > > Can you please create a separate ticket for the strategy implementation > then? > > Done. > > https://issues.apache.org/jira/browse/IGNITE-7077 > > > Any idea on how long will it take? > > I think it will take 2-4 weeks to implement such a strategy. > I try my best to make a ready to review PR before the end of the year. > > > 30.11.2017 02:13, Valentin Kulichenko пишет: > > Nikolay, >> >> Can you please create a separate ticket for the strategy implementation >> then? Any idea on how long will it take? >> >> As for querying a partition, both SqlQuery and SqlFieldQuery allow to >> specify set of partitions to work with (see setPartitions method). I think >> that should be enough. >> >> -Val >> >> On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <voze...@gridgain.com> >> wrote: >> >> Hi Nikolay, >>> >>> No, it is not possible to get this info from public API, neither we >>> planned >>> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better >>> understanding on how this was implemented. >>> >>> Vladimir. >>> >>> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <nizhikov....@gmail.com> >>> wrote: >>> >>> Hello, Vladimir. >>>> >>>> partition pruning is already implemented in Ignite, so there is no need >>>>> >>>> to do this on your own. >>>> >>>> Spark work with partitioned data set. >>>> It is required to provide data partition information to Spark from >>>> custom >>>> Data Source(Ignite). >>>> >>>> Can I get information about pruned partitions throw some public API? >>>> Is there a plan or ticket to implement such API? >>>> >>>> >>>> >>>> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: >>>> >>>> Nikolay, >>>>> >>>>> Regarding p3. - partition pruning is already implemented in Ignite, so >>>>> there is no need to do this on your own. >>>>> >>>>> On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko < >>>>> valentin.kuliche...@gmail.com> wrote: >>>>> >>>>> Nikolay, >>>>>> >>>>>> Custom strategy allows to fully process the AST generated by Spark >>>>>> >>>>> and >>> >>>> convert it to Ignite SQL, so there will be no execution on Spark side >>>>>> >>>>> at >>>> >>>>> all. This is what we are trying to achieve here. Basically, one will >>>>>> >>>>> be >>> >>>> able to use DataFrame API to execute queries directly on Ignite. Does >>>>>> >>>>> it >>>> >>>>> make sense to you? >>>>>> >>>>>> I would recommend you to take a look at MemSQL implementation which >>>>>> >>>>> does >>>> >>>>> similar stuff: https://github.com/memsql/memsql-spark-connector >>>>>> >>>>>> Note that this approach will work only if all relations included in >>>>>> >>>>> AST >>> >>>> are >>>>> >>>>>> Ignite tables. Otherwise, strategy should return null so that Spark >>>>>> >>>>> falls >>>> >>>>> back to its regular mode. Ignite will be used as regular data source >>>>>> >>>>> in >>> >>>> this case, and probably it's possible to implement some optimizations >>>>>> >>>>> here >>>>> >>>>>> as well. However, I never investigated this and it seems like another >>>>>> separate discussion. >>>>>> >>>>>> -Val >>>>>> >>>>>> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков < >>>>>> >>>>> nizhikov....@gmail.com> >>>> >>>>> wrote: >>>>>> >>>>>> Hello, guys. >>>>>>> >>>>>>> I have implemented basic support of Spark Data Frame API [1], [2] >>>>>>> >>>>>> for >>> >>>> Ignite. >>>>>>> Spark provides API for a custom strategy to optimize queries from >>>>>>> >>>>>> spark >>>> >>>>> to >>>>>> >>>>>>> underlying data source(Ignite). >>>>>>> >>>>>>> The goal of optimization(obvious, just to be on the same page): >>>>>>> Minimize data transfer between Spark and Ignite. >>>>>>> Speedup query execution. >>>>>>> >>>>>>> I see 3 ways to optimize queries: >>>>>>> >>>>>>> 1. *Join Reduce* If one make some query that join two or >>>>>>> >>>>>> more >>> >>>> Ignite tables, we have to pass all join info to Ignite and transfer >>>>>>> >>>>>> to >>>> >>>>> Spark only result of table join. >>>>>>> To implement it we have to extend current implementation >>>>>>> >>>>>> with >>> >>>> new >>>>> >>>>>> RelationProvider that can generate all kind of joins for two or >>>>>>> >>>>>> more >>> >>>> tables. >>>>>> >>>>>>> We should add some tests, also. >>>>>>> The question is - how join result should be partitioned? >>>>>>> >>>>>>> >>>>>>> 2. *Order by* If one make some query to Ignite table with >>>>>>> >>>>>> order >>>> >>>>> by >>>>>> >>>>>>> clause we can execute sorting on Ignite side. >>>>>>> But it seems that currently Spark doesn’t have any way to >>>>>>> >>>>>> tell >>>> >>>>> that partitions already sorted. >>>>>>> >>>>>>> >>>>>>> 3. *Key filter* If one make query with `WHERE key = XXX` or >>>>>>> >>>>>> `WHERE >>>>>> >>>>>>> key IN (X, Y, Z)`, we can reduce number of partitions. >>>>>>> And query only partitions that store certain key values. >>>>>>> Is this kind of optimization already built in Ignite or I >>>>>>> >>>>>> should >>>>> >>>>>> implement it by myself? >>>>>>> >>>>>>> May be, there is any other way to make queries run faster? >>>>>>> >>>>>>> [1] https://spark.apache.org/docs/latest/sql-programming-guide. >>>>>>> >>>>>> html >>> >>>> [2] https://github.com/apache/ignite/pull/2742 >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Nikolay Izhikov >>>> nizhikov....@gmail.com >>>> >>>> >>> >>