Re: Slow SQL query uses only a single CPU

eugene miretsky Wed, 22 Aug 2018 07:21:35 -0700

Just as a reference, bellow are 2 execution plans with and without the
index on a very similar table.


Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted */.
1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
setting customer_id as an affinity key. Is there a way to verify that?
2) Is it possible that the removal of /* group sorted */ indicates that the
result of group_by must be sorted? (hence taking a long time)

*Query*
Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
'2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) >
2 AND MAX(ru_total_web_sessions_count) < 1)

*Without an index*

SELECT

__Z0.CUSTOMER_ID AS __C0_0,

SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,

MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2

FROM PUBLIC.GAL2RU __Z0

/* PUBLIC.AFFINITY_KEY */

WHERE __Z0.DT > '2018-06-12'

GROUP BY __Z0.CUSTOMER_ID

/* group sorted */


SELECT

COUNT(*)

FROM (

SELECT

__C0_0 AS CUSTOMER_ID

FROM PUBLIC.__T0

GROUP BY __C0_0

HAVING (SUM(__C0_1) > 2)

AND (MAX(__C0_2) < 1)


*With an index*

SELECT

__Z0.CUSTOMER_ID AS __C0_0,

SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,

MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2

FROM PUBLIC.GAL2RU __Z0

/* PUBLIC.DT_IDX2: DT > '2018-06-12' */

WHERE __Z0.DT > '2018-06-12'

GROUP BY __Z0.CUSTOMER_ID


SELECT

COUNT(*)

FROM (

SELECT

__C0_0 AS CUSTOMER_ID

FROM PUBLIC.__T0

GROUP BY __C0_0

HAVING (SUM(__C0_1) > 2)

AND (MAX(__C0_2) < 1)

) _0__Z1


On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <eugene.miret...@gmail.com>
wrote:

> Thanks Andrey,
>
> We are using the Ignite notebook, any idea if there is a way to provide
> these flags and hints directly from SQL?
>
> From your description, it seems like the query is executed in the
> following order
> 1) Group by customer_id
> 2) For each group, perform the filtering on date using the index and
> aggregates
>
> My impressions was that the order is
> 1)  On each node, filter rows by date (using the index)
> 2)  On each node, group by the remaining rows by customer id, and then
> perform the aggrate
>
> That's why we created the index on the dt field, as opposed to customer_id
> field.
>
> Cheers,
> Eugene
>
>
> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
> andrey.mashen...@gmail.com> wrote:
>
>> Eugene,
>>
>> 1. Note that queryParallelism splits indices and Ignite work similar way
>> as if index data resides on several nodes. These index part can be looked
>> up in parallel threads.
>> 2. It is not a simple query as you data distributed among partitions and
>> is not collocated and aggregate function are used.
>> HAVING clause here is a reason, Ignite can apply it on reduce phase only
>> as HAVING requires aggregate value from all index parts.
>> 3. If you data already collocated on customer_id then you can hit Ignite
>> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
>> optimize grouping and push down aggregates to map phase.
>> 4. In query plan you attached you can see H2 uses DT_IDX
>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>> It is not effective. With this index H2 have to process all data to
>> calculate aggregate for group. Index on affinity field may be more
>> effective as data can be processed group by group.
>> once all group data is process then result can be passed to reducer.
>> Hope, H2 is smart enough to do such streaming.
>>
>> Also, you can try to use composite index on (customer_id, date) columns.
>> Most likely. hint will needed [2].
>>
>> See also about collocated flag [1] and Hits [2]
>>
>> [1] https://ignite.apache.org/releases/latest/javadoc/org/apache
>> /ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-and
>> -debugging#index-hints
>>
>>
>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> Thanks Andrey,
>>>
>>> Right now we are testing with only one big node, so the reduce step
>>> should not take any time.
>>>
>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>> working. Anything else could be preventing multiple cores from working on
>>> the job?
>>> 2) Why would the reduce phase need to look at all the data? It seems
>>> like a fairly simple query
>>> 3) We are already collocating data  by customer_id (though as I
>>> mentioned, right now there is only one node)
>>> 4) We already using collocation and tried using an index, and other
>>> advice? Is there a way to check what Ignite is actually doing? How are
>>> indexs used (by Ignite or H2)?
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>> andrey.mashen...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> 1. Possible there are too much data should be looked for the query.
>>>> With single node and parallelism=1 query will always run in single thread.
>>>>  You can try to add more nodes or increase query parallelism to utilize
>>>> more CPU cores.
>>>>
>>>> 2. Index on date field may be not effective as reduce phase should look
>>>> all the data for further grouping.
>>>> Try add index on customer_id or use collocation in customer_id (usually
>>>> more preferable way).
>>>>
>>>> Also it is possible the bottleneck is the reduce phase.
>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>> This collocation will allow you use collocated flag [1] and Ignite will use
>>>> more optimal plan.
>>>>
>>>> 4. The main techniques is trying to reduce amount to data to be looked
>>>> up on every phase with using data collocation and indices
>>>> Ignite provide 2 plans for distributed queries: map and reduce. You can
>>>> analyse and check these queries separately to understand how much data are
>>>> processed on map phase and on reduce.
>>>> Map query process node local data (until distributed joins on), while
>>>> reduce fetch data from remote node that may costs. .
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>> eugene.miret...@gmail.com> wrote:
>>>>
>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>
>>>>> SELECT
>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>> SELECT
>>>>> COUNT(*)
>>>>> FROM (
>>>>> SELECT
>>>>> __C0_0 AS CUSTOMER_ID
>>>>> FROM PUBLIC.__T0
>>>>> GROUP BY __C0_0
>>>>> HAVING (SUM(__C0_1) > 2)
>>>>> AND (MAX(__C0_2) < 1)
>>>>> ) _0__Z1
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>> eugene.miret...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We have a cache called GAL3EC1, it has
>>>>>>
>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>    2. An Index on the date column
>>>>>>    3. 300 sparse columns
>>>>>>
>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>
>>>>>> The following query takes 8min to finish
>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>> '2018-05-12' GROUP BY customer_id having 
>>>>>> SUM(ec1_bknt_total_product_views_app)
>>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>
>>>>>> I have a few questions:
>>>>>>
>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the
>>>>>>    32 CPUs is used). How can I get the query to use all 32 CPUs? I have 
>>>>>> tried
>>>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>>>    2. Adding the index on date column seems to have slowed down the
>>>>>>    query. The 8min time from above was without the index, with the index 
>>>>>> the
>>>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>    not use the system resource properly, can that be related?
>>>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>>>    problems in Ignite?
>>>>>>
>>>>>>
>>>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>>>> etc.).
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Andrey V. Mashenkov
>>>>
>>>
>>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>

Re: Slow SQL query uses only a single CPU

Reply via email to