Re: Slow SQL query uses only a single CPU

eugene miretsky Wed, 22 Aug 2018 10:12:20 -0700

Thanks,

I tried the composite key and provided the index tip in the query but the
query plan and execution time stayed the same. So really I am back to where
we started.


Right now I suspect that I am not setting the affinity key properly - I
intend the key to be customer_id, but maybe Ignite sets it to the _key
(composite of customer_id, dt). Is there a way to check it?

Also, is there documentation of how exactly data is stored, and how SQL
queries are performed? Where is the data stored (off-heap?)? When is the
data loaded to heap?  How do the ignite pages/segments get to H2?
Are Ignite and H2 indexes the same thing?

Cheers,
Eugene

On Wed, Aug 22, 2018 at 10:36 AM, Andrey Mashenkov <
andrey.mashen...@gmail.com> wrote:

> 1. /* PUBLIC.AFFINITY_KEY */ means index on affinity column is used. Full
> index will be scanned against date condition.
> As I wrote you can create composite index to speedup index scan.
> 2. "group sorted" means index is used for grouping. Looks like H2 have
> optimization for this and grouping can applied on fly.
> Unsorted grouping would means that we have to fetch full dataset and only
> then grouping.
>
> On Wed, Aug 22, 2018 at 5:21 PM eugene miretsky <eugene.miret...@gmail.com>
> wrote:
>
>> Just as a reference, bellow are 2 execution plans with and without the
>> index on a very similar table.
>>
>> Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted
>> */.
>> 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
>> setting customer_id as an affinity key. Is there a way to verify that?
>> 2) Is it possible that the removal of /* group sorted */ indicates that
>> the result of group_by must be sorted? (hence taking a long time)
>>
>> *Query*
>> Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
>> '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count)
>> > 2 AND MAX(ru_total_web_sessions_count) < 1)
>>
>> *Without an index*
>>
>> SELECT
>>
>> __Z0.CUSTOMER_ID AS __C0_0,
>>
>> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>>
>> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>>
>> FROM PUBLIC.GAL2RU __Z0
>>
>> /* PUBLIC.AFFINITY_KEY */
>>
>> WHERE __Z0.DT > '2018-06-12'
>>
>> GROUP BY __Z0.CUSTOMER_ID
>>
>> /* group sorted */
>>
>>
>> SELECT
>>
>> COUNT(*)
>>
>> FROM (
>>
>> SELECT
>>
>> __C0_0 AS CUSTOMER_ID
>>
>> FROM PUBLIC.__T0
>>
>> GROUP BY __C0_0
>>
>> HAVING (SUM(__C0_1) > 2)
>>
>> AND (MAX(__C0_2) < 1)
>>
>>
>> *With an index*
>>
>> SELECT
>>
>> __Z0.CUSTOMER_ID AS __C0_0,
>>
>> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>>
>> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>>
>> FROM PUBLIC.GAL2RU __Z0
>>
>> /* PUBLIC.DT_IDX2: DT > '2018-06-12' */
>>
>> WHERE __Z0.DT > '2018-06-12'
>>
>> GROUP BY __Z0.CUSTOMER_ID
>>
>>
>> SELECT
>>
>> COUNT(*)
>>
>> FROM (
>>
>> SELECT
>>
>> __C0_0 AS CUSTOMER_ID
>>
>> FROM PUBLIC.__T0
>>
>> GROUP BY __C0_0
>>
>> HAVING (SUM(__C0_1) > 2)
>>
>> AND (MAX(__C0_2) < 1)
>>
>> ) _0__Z1
>>
>>
>> On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> Thanks Andrey,
>>>
>>> We are using the Ignite notebook, any idea if there is a way to provide
>>> these flags and hints directly from SQL?
>>>
>>> From your description, it seems like the query is executed in the
>>> following order
>>> 1) Group by customer_id
>>> 2) For each group, perform the filtering on date using the index and
>>> aggregates
>>>
>>> My impressions was that the order is
>>> 1)  On each node, filter rows by date (using the index)
>>> 2)  On each node, group by the remaining rows by customer id, and then
>>> perform the aggrate
>>>
>>> That's why we created the index on the dt field, as opposed to
>>> customer_id field.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
>>> andrey.mashen...@gmail.com> wrote:
>>>
>>>> Eugene,
>>>>
>>>> 1. Note that queryParallelism splits indices and Ignite work similar
>>>> way as if index data resides on several nodes. These index part can be
>>>> looked up in parallel threads.
>>>> 2. It is not a simple query as you data distributed among partitions
>>>> and is not collocated and aggregate function are used.
>>>> HAVING clause here is a reason, Ignite can apply it on reduce phase
>>>> only as HAVING requires aggregate value from all index parts.
>>>> 3. If you data already collocated on customer_id then you can hit
>>>> Ignite with set SqlFieldsQuery.setCollocated(true). This should force
>>>> Ignite to optimize grouping and push down aggregates to map phase.
>>>> 4. In query plan you attached you can see H2 uses DT_IDX
>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>> It is not effective. With this index H2 have to process all data to
>>>> calculate aggregate for group. Index on affinity field may be more
>>>> effective as data can be processed group by group.
>>>> once all group data is process then result can be passed to reducer.
>>>> Hope, H2 is smart enough to do such streaming.
>>>>
>>>> Also, you can try to use composite index on (customer_id, date)
>>>> columns. Most likely. hint will needed [2].
>>>>
>>>> See also about collocated flag [1] and Hits [2]
>>>>
>>>> [1] https://ignite.apache.org/releases/latest/javadoc/org/
>>>> apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>>>> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-
>>>> and-debugging#index-hints
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>>>> eugene.miret...@gmail.com> wrote:
>>>>
>>>>> Thanks Andrey,
>>>>>
>>>>> Right now we are testing with only one big node, so the reduce step
>>>>> should not take any time.
>>>>>
>>>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>>>> working. Anything else could be preventing multiple cores from working on
>>>>> the job?
>>>>> 2) Why would the reduce phase need to look at all the data? It seems
>>>>> like a fairly simple query
>>>>> 3) We are already collocating data  by customer_id (though as I
>>>>> mentioned, right now there is only one node)
>>>>> 4) We already using collocation and tried using an index, and other
>>>>> advice? Is there a way to check what Ignite is actually doing? How are
>>>>> indexs used (by Ignite or H2)?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>>>> andrey.mashen...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 1. Possible there are too much data should be looked for the query.
>>>>>> With single node and parallelism=1 query will always run in single 
>>>>>> thread.
>>>>>>  You can try to add more nodes or increase query parallelism to
>>>>>> utilize more CPU cores.
>>>>>>
>>>>>> 2. Index on date field may be not effective as reduce phase should
>>>>>> look all the data for further grouping.
>>>>>> Try add index on customer_id or use collocation in customer_id
>>>>>> (usually more preferable way).
>>>>>>
>>>>>> Also it is possible the bottleneck is the reduce phase.
>>>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>>>> This collocation will allow you use collocated flag [1] and Ignite will 
>>>>>> use
>>>>>> more optimal plan.
>>>>>>
>>>>>> 4. The main techniques is trying to reduce amount to data to be
>>>>>> looked up on every phase with using data collocation and indices
>>>>>> Ignite provide 2 plans for distributed queries: map and reduce. You
>>>>>> can analyse and check these queries separately to understand how much 
>>>>>> data
>>>>>> are processed on map phase and on reduce.
>>>>>> Map query process node local data (until distributed joins on), while
>>>>>> reduce fetch data from remote node that may costs. .
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>>>> eugene.miret...@gmail.com> wrote:
>>>>>>
>>>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>>>
>>>>>>> SELECT
>>>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>>>> SELECT
>>>>>>> COUNT(*)
>>>>>>> FROM (
>>>>>>> SELECT
>>>>>>> __C0_0 AS CUSTOMER_ID
>>>>>>> FROM PUBLIC.__T0
>>>>>>> GROUP BY __C0_0
>>>>>>> HAVING (SUM(__C0_1) > 2)
>>>>>>> AND (MAX(__C0_2) < 1)
>>>>>>> ) _0__Z1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>>>> eugene.miret...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have a cache called GAL3EC1, it has
>>>>>>>>
>>>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>>>    2. An Index on the date column
>>>>>>>>    3. 300 sparse columns
>>>>>>>>
>>>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>>>
>>>>>>>> The following query takes 8min to finish
>>>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>>>> '2018-05-12' GROUP BY customer_id having 
>>>>>>>> SUM(ec1_bknt_total_product_views_app)
>>>>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>>>
>>>>>>>> I have a few questions:
>>>>>>>>
>>>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of
>>>>>>>>    the 32 CPUs is used). How can I get the query to use all 32 CPUs? I 
>>>>>>>> have
>>>>>>>>    tried setting Query Parallelism to 32, but it didn't help,
>>>>>>>>    2. Adding the index on date column seems to have slowed down
>>>>>>>>    the query. The 8min time from above was without the index, with the 
>>>>>>>> index
>>>>>>>>    the query doesn't finish (I gave up after 30min). A similar query 
>>>>>>>> on a
>>>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>>>    not use the system resource properly, can that be related?
>>>>>>>>    4. What are some good tools and techniques to troubleshoot
>>>>>>>>    these problems in Ignite?
>>>>>>>>
>>>>>>>>
>>>>>>>> All the relevant info is attached (configs, cache stats, node
>>>>>>>> stats, etc.).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Andrey V. Mashenkov
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Andrey V. Mashenkov
>>>>
>>>
>>>
>>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Slow SQL query uses only a single CPU

Reply via email to