Just as a reference, bellow are 2 execution plans with and without the index on a very similar table.
Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted */. 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are setting customer_id as an affinity key. Is there a way to verify that? 2) Is it possible that the removal of /* group sorted */ indicates that the result of group_by must be sorted? (hence taking a long time) *Query* Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt > '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) > 2 AND MAX(ru_total_web_sessions_count) < 1) *Without an index* SELECT __Z0.CUSTOMER_ID AS __C0_0, SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1, MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2 FROM PUBLIC.GAL2RU __Z0 /* PUBLIC.AFFINITY_KEY */ WHERE __Z0.DT > '2018-06-12' GROUP BY __Z0.CUSTOMER_ID /* group sorted */ SELECT COUNT(*) FROM ( SELECT __C0_0 AS CUSTOMER_ID FROM PUBLIC.__T0 GROUP BY __C0_0 HAVING (SUM(__C0_1) > 2) AND (MAX(__C0_2) < 1) *With an index* SELECT __Z0.CUSTOMER_ID AS __C0_0, SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1, MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2 FROM PUBLIC.GAL2RU __Z0 /* PUBLIC.DT_IDX2: DT > '2018-06-12' */ WHERE __Z0.DT > '2018-06-12' GROUP BY __Z0.CUSTOMER_ID SELECT COUNT(*) FROM ( SELECT __C0_0 AS CUSTOMER_ID FROM PUBLIC.__T0 GROUP BY __C0_0 HAVING (SUM(__C0_1) > 2) AND (MAX(__C0_2) < 1) ) _0__Z1 On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <eugene.miret...@gmail.com> wrote: > Thanks Andrey, > > We are using the Ignite notebook, any idea if there is a way to provide > these flags and hints directly from SQL? > > From your description, it seems like the query is executed in the > following order > 1) Group by customer_id > 2) For each group, perform the filtering on date using the index and > aggregates > > My impressions was that the order is > 1) On each node, filter rows by date (using the index) > 2) On each node, group by the remaining rows by customer id, and then > perform the aggrate > > That's why we created the index on the dt field, as opposed to customer_id > field. > > Cheers, > Eugene > > > On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov < > andrey.mashen...@gmail.com> wrote: > >> Eugene, >> >> 1. Note that queryParallelism splits indices and Ignite work similar way >> as if index data resides on several nodes. These index part can be looked >> up in parallel threads. >> 2. It is not a simple query as you data distributed among partitions and >> is not collocated and aggregate function are used. >> HAVING clause here is a reason, Ignite can apply it on reduce phase only >> as HAVING requires aggregate value from all index parts. >> 3. If you data already collocated on customer_id then you can hit Ignite >> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to >> optimize grouping and push down aggregates to map phase. >> 4. In query plan you attached you can see H2 uses DT_IDX >> /* PUBLIC.DT_IDX: DT > '2018-05-12' */ >> It is not effective. With this index H2 have to process all data to >> calculate aggregate for group. Index on affinity field may be more >> effective as data can be processed group by group. >> once all group data is process then result can be passed to reducer. >> Hope, H2 is smart enough to do such streaming. >> >> Also, you can try to use composite index on (customer_id, date) columns. >> Most likely. hint will needed [2]. >> >> See also about collocated flag [1] and Hits [2] >> >> [1] https://ignite.apache.org/releases/latest/javadoc/org/apache >> /ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean- >> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-and >> -debugging#index-hints >> >> >> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky < >> eugene.miret...@gmail.com> wrote: >> >>> Thanks Andrey, >>> >>> Right now we are testing with only one big node, so the reduce step >>> should not take any time. >>> >>> 1) We already set parallelism to 32, and I can still see only 1 core >>> working. Anything else could be preventing multiple cores from working on >>> the job? >>> 2) Why would the reduce phase need to look at all the data? It seems >>> like a fairly simple query >>> 3) We are already collocating data by customer_id (though as I >>> mentioned, right now there is only one node) >>> 4) We already using collocation and tried using an index, and other >>> advice? Is there a way to check what Ignite is actually doing? How are >>> indexs used (by Ignite or H2)? >>> >>> Cheers, >>> Eugene >>> >>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov < >>> andrey.mashen...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> 1. Possible there are too much data should be looked for the query. >>>> With single node and parallelism=1 query will always run in single thread. >>>> You can try to add more nodes or increase query parallelism to utilize >>>> more CPU cores. >>>> >>>> 2. Index on date field may be not effective as reduce phase should look >>>> all the data for further grouping. >>>> Try add index on customer_id or use collocation in customer_id (usually >>>> more preferable way). >>>> >>>> Also it is possible the bottleneck is the reduce phase. >>>> Is it possible to collocate data by group by column (customer_id)? >>>> This collocation will allow you use collocated flag [1] and Ignite will use >>>> more optimal plan. >>>> >>>> 4. The main techniques is trying to reduce amount to data to be looked >>>> up on every phase with using data collocation and indices >>>> Ignite provide 2 plans for distributed queries: map and reduce. You can >>>> analyse and check these queries separately to understand how much data are >>>> processed on map phase and on reduce. >>>> Map query process node local data (until distributed joins on), while >>>> reduce fetch data from remote node that may costs. . >>>> >>>> >>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky < >>>> eugene.miret...@gmail.com> wrote: >>>> >>>>> Here is the result of EXPLAIN for the afermantioned query: >>>>> >>>>> SELECT >>>>> __Z0.CUSTOMER_ID AS __C0_0, >>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1, >>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2 >>>>> FROM PUBLIC.GAL3EC1 __Z0 >>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */ >>>>> WHERE __Z0.DT > '2018-05-12' >>>>> GROUP BY __Z0.CUSTOMER_ID >>>>> SELECT >>>>> COUNT(*) >>>>> FROM ( >>>>> SELECT >>>>> __C0_0 AS CUSTOMER_ID >>>>> FROM PUBLIC.__T0 >>>>> GROUP BY __C0_0 >>>>> HAVING (SUM(__C0_1) > 2) >>>>> AND (MAX(__C0_2) < 1) >>>>> ) _0__Z1 >>>>> >>>>> >>>>> >>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky < >>>>> eugene.miret...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We have a cache called GAL3EC1, it has >>>>>> >>>>>> 1. A composite pKey consisting of customer_id and date >>>>>> 2. An Index on the date column >>>>>> 3. 300 sparse columns >>>>>> >>>>>> We are running a single EC2 4x8xlarge node. >>>>>> >>>>>> The following query takes 8min to finish >>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt > >>>>>> '2018-05-12' GROUP BY customer_id having >>>>>> SUM(ec1_bknt_total_product_views_app) >>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1) >>>>>> >>>>>> I have a few questions: >>>>>> >>>>>> 1. 'top' command shows %100 cpu utilization (i.e only one of the >>>>>> 32 CPUs is used). How can I get the query to use all 32 CPUs? I have >>>>>> tried >>>>>> setting Query Parallelism to 32, but it didn't help, >>>>>> 2. Adding the index on date column seems to have slowed down the >>>>>> query. The 8min time from above was without the index, with the index >>>>>> the >>>>>> query doesn't finish (I gave up after 30min). A similar query on a >>>>>> smaller date range showed a 10x slow down with the index. Why? >>>>>> 3. Our loads from Spark are very slow as well, and also seem to >>>>>> not use the system resource properly, can that be related? >>>>>> 4. What are some good tools and techniques to troubleshoot these >>>>>> problems in Ignite? >>>>>> >>>>>> >>>>>> All the relevant info is attached (configs, cache stats, node stats, >>>>>> etc.). >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> Best regards, >>>> Andrey V. Mashenkov >>>> >>> >>> >> >> -- >> Best regards, >> Andrey V. Mashenkov >> > >