Re: Optimization of SQL queries from Spark Data Frame to Ignite

Valentin Kulichenko Thu, 30 Nov 2017 12:41:28 -0800

Great! Let me know if you need any assistance and/or intermediate review.

-Val


On Thu, Nov 30, 2017 at 12:05 AM, Николай Ижиков <nizhikov....@gmail.com>
wrote:

> Valentin,
>
> > Can you please create a separate ticket for the strategy implementation
> then?
>
> Done.
>
> https://issues.apache.org/jira/browse/IGNITE-7077
>
> > Any idea on how long will it take?
>
> I think it will take 2-4 weeks to implement such a strategy.
> I try my best to make a ready to review PR before the end of the year.
>
>
> 30.11.2017 02:13, Valentin Kulichenko пишет:
>
> Nikolay,
>>
>> Can you please create a separate ticket for the strategy implementation
>> then? Any idea on how long will it take?
>>
>> As for querying a partition, both SqlQuery and SqlFieldQuery allow to
>> specify set of partitions to work with (see setPartitions method). I think
>> that should be enough.
>>
>> -Val
>>
>> On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <voze...@gridgain.com>
>> wrote:
>>
>> Hi Nikolay,
>>>
>>> No, it is not possible to get this info from public API, neither we
>>> planned
>>> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
>>> understanding on how this was implemented.
>>>
>>> Vladimir.
>>>
>>> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <nizhikov....@gmail.com>
>>> wrote:
>>>
>>> Hello, Vladimir.
>>>>
>>>> partition pruning is already implemented in Ignite, so there is no need
>>>>>
>>>> to do this on your own.
>>>>
>>>> Spark work with partitioned data set.
>>>> It is required to provide data partition information to Spark from
>>>> custom
>>>> Data Source(Ignite).
>>>>
>>>> Can I get information about pruned partitions throw some public API?
>>>> Is there a plan or ticket to implement such API?
>>>>
>>>>
>>>>
>>>> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
>>>>
>>>> Nikolay,
>>>>>
>>>>> Regarding p3. - partition pruning is already implemented in Ignite, so
>>>>> there is no need to do this on your own.
>>>>>
>>>>> On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
>>>>> valentin.kuliche...@gmail.com> wrote:
>>>>>
>>>>> Nikolay,
>>>>>>
>>>>>> Custom strategy allows to fully process the AST generated by Spark
>>>>>>
>>>>> and
>>>
>>>> convert it to Ignite SQL, so there will be no execution on Spark side
>>>>>>
>>>>> at
>>>>
>>>>> all. This is what we are trying to achieve here. Basically, one will
>>>>>>
>>>>> be
>>>
>>>> able to use DataFrame API to execute queries directly on Ignite. Does
>>>>>>
>>>>> it
>>>>
>>>>> make sense to you?
>>>>>>
>>>>>> I would recommend you to take a look at MemSQL implementation which
>>>>>>
>>>>> does
>>>>
>>>>> similar stuff: https://github.com/memsql/memsql-spark-connector
>>>>>>
>>>>>> Note that this approach will work only if all relations included in
>>>>>>
>>>>> AST
>>>
>>>> are
>>>>>
>>>>>> Ignite tables. Otherwise, strategy should return null so that Spark
>>>>>>
>>>>> falls
>>>>
>>>>> back to its regular mode. Ignite will be used as regular data source
>>>>>>
>>>>> in
>>>
>>>> this case, and probably it's possible to implement some optimizations
>>>>>>
>>>>> here
>>>>>
>>>>>> as well. However, I never investigated this and it seems like another
>>>>>> separate discussion.
>>>>>>
>>>>>> -Val
>>>>>>
>>>>>> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
>>>>>>
>>>>> nizhikov....@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hello, guys.
>>>>>>>
>>>>>>> I have implemented basic support of Spark Data Frame API [1], [2]
>>>>>>>
>>>>>> for
>>>
>>>> Ignite.
>>>>>>> Spark provides API for a custom strategy to optimize queries from
>>>>>>>
>>>>>> spark
>>>>
>>>>> to
>>>>>>
>>>>>>> underlying data source(Ignite).
>>>>>>>
>>>>>>> The goal of optimization(obvious, just to be on the same page):
>>>>>>> Minimize data transfer between Spark and Ignite.
>>>>>>> Speedup query execution.
>>>>>>>
>>>>>>> I see 3 ways to optimize queries:
>>>>>>>
>>>>>>>          1. *Join Reduce* If one make some query that join two or
>>>>>>>
>>>>>> more
>>>
>>>> Ignite tables, we have to pass all join info to Ignite and transfer
>>>>>>>
>>>>>> to
>>>>
>>>>> Spark only result of table join.
>>>>>>>          To implement it we have to extend current implementation
>>>>>>>
>>>>>> with
>>>
>>>> new
>>>>>
>>>>>> RelationProvider that can generate all kind of joins for two or
>>>>>>>
>>>>>> more
>>>
>>>> tables.
>>>>>>
>>>>>>>          We should add some tests, also.
>>>>>>>          The question is - how join result should be partitioned?
>>>>>>>
>>>>>>>
>>>>>>>          2. *Order by* If one make some query to Ignite table with
>>>>>>>
>>>>>> order
>>>>
>>>>> by
>>>>>>
>>>>>>> clause we can execute sorting on Ignite side.
>>>>>>>          But it seems that currently Spark doesn’t have any way to
>>>>>>>
>>>>>> tell
>>>>
>>>>> that partitions already sorted.
>>>>>>>
>>>>>>>
>>>>>>>          3. *Key filter* If one make query with `WHERE key = XXX` or
>>>>>>>
>>>>>> `WHERE
>>>>>>
>>>>>>> key IN (X, Y, Z)`, we can reduce number of partitions.
>>>>>>>          And query only partitions that store certain key values.
>>>>>>>          Is this kind of optimization already built in Ignite or I
>>>>>>>
>>>>>> should
>>>>>
>>>>>> implement it by myself?
>>>>>>>
>>>>>>> May be, there is any other way to make queries run faster?
>>>>>>>
>>>>>>> [1] https://spark.apache.org/docs/latest/sql-programming-guide.
>>>>>>>
>>>>>> html
>>>
>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nikolay Izhikov
>>>> nizhikov....@gmail.com
>>>>
>>>>
>>>
>>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to