Hello! I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.
Regards, -- Ilya Kasnacheev вт, 18 сент. 2018 г. в 17:47, eugene miretsky <eugene.miret...@gmail.com>: > Hi Ilya, > > The different query result was my mistake - one of the categoy_ids was > duplicate, so in the query that used join, it counted rows for that > category twice. My apologies. > > However, we are still having an issue with query time, and the index not > being applied to category_id. Would appreciate if you could take a look. > > Cheers, > Eugene > > On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > >> Hello! >> >> Why don't you diff the results of those two queries, tell us what the >> difference is? >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <eugene.miret...@gmail.com >> >: >> >>> Hello, >>> >>> Just wanted to see if anybody had time to look into this. >>> >>> Cheers, >>> Eugene >>> >>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky < >>> eugene.miret...@gmail.com> wrote: >>> >>>> Thanks! >>>> >>>> Tried joining with an inlined table instead of IN as per the second >>>> suggestion, and it didn't quite work. >>>> >>>> Query1: >>>> >>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 use >>>> Index( ) where category_id in (9005, 175930, 175930, >>>> 175940,175945,101450, >>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>> SUM(product_clicks_app) > 1 ) >>>> - exec time = 17s >>>> - *Result: 3105868* >>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or >>>> customer_id index >>>> - Using an index on category_id increases the query time 33s >>>> >>>> Query2: >>>> >>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga use >>>> index (PUBLIC."_key_PK") inner join table(category_id int = (9005, >>>> 175930, >>>> 175930, 175940,175945,101450, 6453)) cats on cats.category_id = >>>> ga.category_id group by customer_id having SUM(product_views_app) > 2 >>>> OR >>>> SUM(product_clicks_app) > 1 ) >>>> - exec time = 38s >>>> - *Result: 3113921* >>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or >>>> customer_id index or category_id index >>>> - Using an index on category_id doesnt change the run time >>>> >>>> Query plans are attached. >>>> >>>> 3 questions: >>>> >>>> 1. Why is the result differnt for the 2 queries - this is quite >>>> concerning. >>>> 2. Why is the 2nd query taking longer >>>> 3. Why category_id index doesn't work in case of query 2. >>>> >>>> >>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev < >>>> ilya.kasnach...@gmail.com> wrote: >>>> >>>>> Hello! >>>>> >>>>> I don't think that we're able to use index with IN () clauses. Please >>>>> convert it into OR clauses. >>>>> >>>>> Please see >>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations >>>>> >>>>> Regards, >>>>> -- >>>>> Ilya Kasnacheev >>>>> >>>>> >>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov < >>>>> andrey.mashen...@gmail.com>: >>>>> >>>>>> Hi >>>>>> >>>>>> Actually, first query uses index on affinity key which looks more >>>>>> efficient than index on category_id column. >>>>>> The first query can process groups one by one and stream partial >>>>>> results from map phase to reduce phase as it use sorted index lookup, >>>>>> while second query should process full dataset on map phase before >>>>>> pass it for reducing. >>>>>> >>>>>> Try to use composite index (customer_id, category_id). >>>>>> >>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to >>>>>> build more efficient plan when group by on collocated column is used. >>>>>> >>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky < >>>>>> eugene.miret...@gmail.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Schema: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID >>>>>>> >>>>>>> PUBLIC.GATABLE2.DT >>>>>>> >>>>>>> PUBLIC.GATABLE2.CATEGORY_ID >>>>>>> >>>>>>> PUBLIC.GATABLE2.VERTICAL_ID >>>>>>> >>>>>>> PUBLIC.GATABLE2.SERVICE >>>>>>> >>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>>>>>> >>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>>>>>> >>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>>>>>> >>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>>>>>> >>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>>>>>> >>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>>>>>> - pkey = customer_id,dt >>>>>>> - affinityKey = customer >>>>>>> >>>>>>> Query: >>>>>>> >>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2 where >>>>>>> category_id in (175925, 101450, 9005, 175930, 175930, >>>>>>> 175940,175945,101450, >>>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>>>>> SUM(product_clicks_app) > 1 ) >>>>>>> >>>>>>> The table has 600M rows. >>>>>>> At first, the query took 1m, when we added an index on category_id >>>>>>> the query started taking 3m. >>>>>>> >>>>>>> The SQL execution plan for both queries is attached. >>>>>>> >>>>>>> We are using a single x1.16xlarge insntace with query parallelism >>>>>>> set to 32 >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Andrey V. Mashenkov >>>>>> >>>>>