Thanks, I tried the composite key and provided the index tip in the query but the query plan and execution time stayed the same. So really I am back to where we started.
Right now I suspect that I am not setting the affinity key properly - I intend the key to be customer_id, but maybe Ignite sets it to the _key (composite of customer_id, dt). Is there a way to check it? Also, is there documentation of how exactly data is stored, and how SQL queries are performed? Where is the data stored (off-heap?)? When is the data loaded to heap? How do the ignite pages/segments get to H2? Are Ignite and H2 indexes the same thing? Cheers, Eugene On Wed, Aug 22, 2018 at 10:36 AM, Andrey Mashenkov < andrey.mashen...@gmail.com> wrote: > 1. /* PUBLIC.AFFINITY_KEY */ means index on affinity column is used. Full > index will be scanned against date condition. > As I wrote you can create composite index to speedup index scan. > 2. "group sorted" means index is used for grouping. Looks like H2 have > optimization for this and grouping can applied on fly. > Unsorted grouping would means that we have to fetch full dataset and only > then grouping. > > On Wed, Aug 22, 2018 at 5:21 PM eugene miretsky <eugene.miret...@gmail.com> > wrote: > >> Just as a reference, bellow are 2 execution plans with and without the >> index on a very similar table. >> >> Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted >> */. >> 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are >> setting customer_id as an affinity key. Is there a way to verify that? >> 2) Is it possible that the removal of /* group sorted */ indicates that >> the result of group_by must be sorted? (hence taking a long time) >> >> *Query* >> Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt > >> '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) >> > 2 AND MAX(ru_total_web_sessions_count) < 1) >> >> *Without an index* >> >> SELECT >> >> __Z0.CUSTOMER_ID AS __C0_0, >> >> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1, >> >> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2 >> >> FROM PUBLIC.GAL2RU __Z0 >> >> /* PUBLIC.AFFINITY_KEY */ >> >> WHERE __Z0.DT > '2018-06-12' >> >> GROUP BY __Z0.CUSTOMER_ID >> >> /* group sorted */ >> >> >> SELECT >> >> COUNT(*) >> >> FROM ( >> >> SELECT >> >> __C0_0 AS CUSTOMER_ID >> >> FROM PUBLIC.__T0 >> >> GROUP BY __C0_0 >> >> HAVING (SUM(__C0_1) > 2) >> >> AND (MAX(__C0_2) < 1) >> >> >> *With an index* >> >> SELECT >> >> __Z0.CUSTOMER_ID AS __C0_0, >> >> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1, >> >> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2 >> >> FROM PUBLIC.GAL2RU __Z0 >> >> /* PUBLIC.DT_IDX2: DT > '2018-06-12' */ >> >> WHERE __Z0.DT > '2018-06-12' >> >> GROUP BY __Z0.CUSTOMER_ID >> >> >> SELECT >> >> COUNT(*) >> >> FROM ( >> >> SELECT >> >> __C0_0 AS CUSTOMER_ID >> >> FROM PUBLIC.__T0 >> >> GROUP BY __C0_0 >> >> HAVING (SUM(__C0_1) > 2) >> >> AND (MAX(__C0_2) < 1) >> >> ) _0__Z1 >> >> >> On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky < >> eugene.miret...@gmail.com> wrote: >> >>> Thanks Andrey, >>> >>> We are using the Ignite notebook, any idea if there is a way to provide >>> these flags and hints directly from SQL? >>> >>> From your description, it seems like the query is executed in the >>> following order >>> 1) Group by customer_id >>> 2) For each group, perform the filtering on date using the index and >>> aggregates >>> >>> My impressions was that the order is >>> 1) On each node, filter rows by date (using the index) >>> 2) On each node, group by the remaining rows by customer id, and then >>> perform the aggrate >>> >>> That's why we created the index on the dt field, as opposed to >>> customer_id field. >>> >>> Cheers, >>> Eugene >>> >>> >>> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov < >>> andrey.mashen...@gmail.com> wrote: >>> >>>> Eugene, >>>> >>>> 1. Note that queryParallelism splits indices and Ignite work similar >>>> way as if index data resides on several nodes. These index part can be >>>> looked up in parallel threads. >>>> 2. It is not a simple query as you data distributed among partitions >>>> and is not collocated and aggregate function are used. >>>> HAVING clause here is a reason, Ignite can apply it on reduce phase >>>> only as HAVING requires aggregate value from all index parts. >>>> 3. If you data already collocated on customer_id then you can hit >>>> Ignite with set SqlFieldsQuery.setCollocated(true). This should force >>>> Ignite to optimize grouping and push down aggregates to map phase. >>>> 4. In query plan you attached you can see H2 uses DT_IDX >>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */ >>>> It is not effective. With this index H2 have to process all data to >>>> calculate aggregate for group. Index on affinity field may be more >>>> effective as data can be processed group by group. >>>> once all group data is process then result can be passed to reducer. >>>> Hope, H2 is smart enough to do such streaming. >>>> >>>> Also, you can try to use composite index on (customer_id, date) >>>> columns. Most likely. hint will needed [2]. >>>> >>>> See also about collocated flag [1] and Hits [2] >>>> >>>> [1] https://ignite.apache.org/releases/latest/javadoc/org/ >>>> apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean- >>>> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance- >>>> and-debugging#index-hints >>>> >>>> >>>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky < >>>> eugene.miret...@gmail.com> wrote: >>>> >>>>> Thanks Andrey, >>>>> >>>>> Right now we are testing with only one big node, so the reduce step >>>>> should not take any time. >>>>> >>>>> 1) We already set parallelism to 32, and I can still see only 1 core >>>>> working. Anything else could be preventing multiple cores from working on >>>>> the job? >>>>> 2) Why would the reduce phase need to look at all the data? It seems >>>>> like a fairly simple query >>>>> 3) We are already collocating data by customer_id (though as I >>>>> mentioned, right now there is only one node) >>>>> 4) We already using collocation and tried using an index, and other >>>>> advice? Is there a way to check what Ignite is actually doing? How are >>>>> indexs used (by Ignite or H2)? >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov < >>>>> andrey.mashen...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> 1. Possible there are too much data should be looked for the query. >>>>>> With single node and parallelism=1 query will always run in single >>>>>> thread. >>>>>> You can try to add more nodes or increase query parallelism to >>>>>> utilize more CPU cores. >>>>>> >>>>>> 2. Index on date field may be not effective as reduce phase should >>>>>> look all the data for further grouping. >>>>>> Try add index on customer_id or use collocation in customer_id >>>>>> (usually more preferable way). >>>>>> >>>>>> Also it is possible the bottleneck is the reduce phase. >>>>>> Is it possible to collocate data by group by column (customer_id)? >>>>>> This collocation will allow you use collocated flag [1] and Ignite will >>>>>> use >>>>>> more optimal plan. >>>>>> >>>>>> 4. The main techniques is trying to reduce amount to data to be >>>>>> looked up on every phase with using data collocation and indices >>>>>> Ignite provide 2 plans for distributed queries: map and reduce. You >>>>>> can analyse and check these queries separately to understand how much >>>>>> data >>>>>> are processed on map phase and on reduce. >>>>>> Map query process node local data (until distributed joins on), while >>>>>> reduce fetch data from remote node that may costs. . >>>>>> >>>>>> >>>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky < >>>>>> eugene.miret...@gmail.com> wrote: >>>>>> >>>>>>> Here is the result of EXPLAIN for the afermantioned query: >>>>>>> >>>>>>> SELECT >>>>>>> __Z0.CUSTOMER_ID AS __C0_0, >>>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1, >>>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2 >>>>>>> FROM PUBLIC.GAL3EC1 __Z0 >>>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */ >>>>>>> WHERE __Z0.DT > '2018-05-12' >>>>>>> GROUP BY __Z0.CUSTOMER_ID >>>>>>> SELECT >>>>>>> COUNT(*) >>>>>>> FROM ( >>>>>>> SELECT >>>>>>> __C0_0 AS CUSTOMER_ID >>>>>>> FROM PUBLIC.__T0 >>>>>>> GROUP BY __C0_0 >>>>>>> HAVING (SUM(__C0_1) > 2) >>>>>>> AND (MAX(__C0_2) < 1) >>>>>>> ) _0__Z1 >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky < >>>>>>> eugene.miret...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We have a cache called GAL3EC1, it has >>>>>>>> >>>>>>>> 1. A composite pKey consisting of customer_id and date >>>>>>>> 2. An Index on the date column >>>>>>>> 3. 300 sparse columns >>>>>>>> >>>>>>>> We are running a single EC2 4x8xlarge node. >>>>>>>> >>>>>>>> The following query takes 8min to finish >>>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt > >>>>>>>> '2018-05-12' GROUP BY customer_id having >>>>>>>> SUM(ec1_bknt_total_product_views_app) >>>>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1) >>>>>>>> >>>>>>>> I have a few questions: >>>>>>>> >>>>>>>> 1. 'top' command shows %100 cpu utilization (i.e only one of >>>>>>>> the 32 CPUs is used). How can I get the query to use all 32 CPUs? I >>>>>>>> have >>>>>>>> tried setting Query Parallelism to 32, but it didn't help, >>>>>>>> 2. Adding the index on date column seems to have slowed down >>>>>>>> the query. The 8min time from above was without the index, with the >>>>>>>> index >>>>>>>> the query doesn't finish (I gave up after 30min). A similar query >>>>>>>> on a >>>>>>>> smaller date range showed a 10x slow down with the index. Why? >>>>>>>> 3. Our loads from Spark are very slow as well, and also seem to >>>>>>>> not use the system resource properly, can that be related? >>>>>>>> 4. What are some good tools and techniques to troubleshoot >>>>>>>> these problems in Ignite? >>>>>>>> >>>>>>>> >>>>>>>> All the relevant info is attached (configs, cache stats, node >>>>>>>> stats, etc.). >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Eugene >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Andrey V. Mashenkov >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Best regards, >>>> Andrey V. Mashenkov >>>> >>> >>> >> > > -- > Best regards, > Andrey V. Mashenkov >