Thanks! Could you please clarfiy "*In case of a composite index, it will apply the columns one by one"? *
Igntie (or rather H2?) needs to load the data into heap in order to do the groupBy & aggregations. We were hoping that only data that matches the category filter will be loaded. *What does one by one mean when: (assuming and index *(customer_id, category_id)*) * 1. *The fiilter is on both customer and category. What data will be loaded into Heap?* 2. *The fitler is only on **category, and the customer is just used for groupBy. Will Ignite* 1. * load one customer with all the rows, and apply the category filter in heap* 2. *load one customer, but load only the rows that pass the category fitler in heap* 3. *load all the events that pass the category filter, and then group them by customer. * *From out benchmarking so far it seems like 1 is happening. * On Thu, Oct 11, 2018 at 1:28 PM Stanislav Lukyanov <stanlukya...@gmail.com> wrote: > Hi, > > > > It is a rather lengthy thread and I can’t dive into details right now, > > but AFAICS the issue now is making affinity key index to work with a > secondary index. > > The important things to understand is > > 1. Ignite will only use one index per table > 2. In case of a composite index, it will apply the columns one by one > 3. The affinity key index should always go first as the first step is > splitting the query by affinity key values > > > > So, to use index over the affinity key (customer_id) and a secondary index > (category_id) one needs to create an index > > like (customer_id, category_id), in that order, with no columns in between. > > Note that index (customer_id, dt, category_id) can’t be used instead of it. > > On the other hand, (customer_id, category_id, dt) can - the last part of > the index will be left unused. > > > > Thanks, > > Stan > > > > *From: *eugene miretsky <eugene.miret...@gmail.com> > *Sent: *9 октября 2018 г. 19:40 > *To: *user@ignite.apache.org > *Subject: *Re: Query 3x slower with index > > > > Hi Ilya, > > > > I have tried it, and got the same performance as forcing using category > index in my initial benchmark - query is 3x slowers and uses only one > thread. > > > > From my experiments so far it seems like Ignite can either (a) use > affinity key and run queries in parallel, (b) use index but run the query > on only one thread. > > > > Has anybody been able to run OLAP like queries in while using an index? > > > > Cheers, > > Eugene > > > > On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev < > ilya.kasnach...@gmail.com> wrote: > > Hello! > > > > I guess that using AFFINITY_KEY as index have something to do with the > fact that GROUP BY really wants to work per-partition. > > > > I have the following query for you: > > > > 1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select > customer_id from (Select customer_id, product_views_app, product_clict_app > from GA_DATA ga join table(category_id int = ( 117930, 175930, > 175940,175945,101450)) cats on cats.category_id = ga.category_id) data > group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clict_app) > 1); > PLAN SELECT > DATA__Z2.CUSTOMER_ID AS __C0_0, > SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1, > SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2 > FROM ( > SELECT > GA__Z0.CUSTOMER_ID, > GA__Z0.PRODUCT_VIEWS_APP, > GA__Z0.PRODUCT_CLICT_APP > FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, > 101450)) CATS__Z1 > INNER JOIN PUBLIC.GA_DATA GA__Z0 > ON 1=1 > WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID > ) DATA__Z2 > /* SELECT > GA__Z0.CUSTOMER_ID, > GA__Z0.PRODUCT_VIEWS_APP, > GA__Z0.PRODUCT_CLICT_APP > FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, > 101450)) CATS__Z1 > /++ function ++/ > INNER JOIN PUBLIC.GA_DATA GA__Z0 > /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/ > ON 1=1 > WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID > */ > GROUP BY DATA__Z2.CUSTOMER_ID > > PLAN SELECT > COUNT(*) > FROM ( > SELECT > __C0_0 AS CUSTOMER_ID > FROM PUBLIC.__T0 > GROUP BY __C0_0 > HAVING (SUM(__C0_1) > 2) > OR (SUM(__C0_2) > 1) > ) _18__Z3 > /* SELECT > __C0_0 AS CUSTOMER_ID > FROM PUBLIC.__T0 > /++ PUBLIC."merge_scan" ++/ > GROUP BY __C0_0 > HAVING (SUM(__C0_1) > 2) > OR (SUM(__C0_2) > 1) > */ > > > > However, I'm not sure it is "optimal" or not since I have no idea if it > will perform better or worse on real data. That's why I need a subset of > data which will make query execution speed readily visible. Unfortunately, > I can't deduce that from query plan alone. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 24 сент. 2018 г. в 16:14, eugene miretsky <eugene.miret...@gmail.com>: > > An easy way to reproduce would be to > > > > 1. Create table > > CREATE TABLE GA_DATA ( > > customer_id bigint, > > dt timestamp, > > category_id int, > > product_views_app int, > > product_clict_app int, > > product_clict_web int, > > product_clict_web int, > > PRIMARY KEY (customer_id, dt, category_id) > > ) WITH "template=ga_template, backups=0, affinityKey=customer_id"; > > > > 2. Create indexes > > · CREATE INDEX ga_customer_id ON GA_Data (customer_id) > > · CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id) > > · CREATE INDEX ga_category_and_customer_id ON GA_Data > (category_id, customer_id) > > · CREATE INDEX ga_category_id ON GA_Data (category_id) > > 3. Run Explain on the following queries while trying forcing using > different indexes > > · Select count(*) FROM( > > Select customer_id from GA_DATA use index (ga_category_id) > > where category_id in (117930, 175930, 175940,175945,101450) > > group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > > > > · Select count(*) FROM( > > Select customer_id from GA_DATA ga use index (ga_pKey) > > join table(category_id int = ( 117930, 175930, 175940,175945,101450)) > cats on cats.category_id = ga.category_id > > group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 > > ) > > > > The execution plans will be similar to what I have posted earler. In > particular, only on of (a) affinty key index, (b) category_id index will be > used. > > > > On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > > Hello! > > > > Can you share a reproducer project which loads (or generates) data for > caches and then queries them? I could try and debug it if I had the > reproducer. > > > > Regards. > > -- > > Ilya Kasnacheev > > > > > > чт, 20 сент. 2018 г. в 21:05, eugene miretsky <eugene.miret...@gmail.com>: > > Thanks Ilya, > > > > Tried it, no luck. It performs the same as when using category_id index > alone (slow). > > Any combindation I try either uses AFFINITY_KEY or category index. When > it uses category index it runs slowers. > > > > Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query > parallelism settings ) when category_id is used, the jobs runs on one > thread most of the time (first few seconds it looks like more threads are > doing work). > > > > Please help on this. It seems like a very simple use case (using affinity > key and another index), either I am doing something extremly silly, or I > stumbled on a bug in Ignite that's effecting a lot of people. > > > > Cheers, > > Eugene > > > > On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > > Hello! > > > > > 2) ga_customer_and_category_id: on customer_id and category_id > > > > Have you tried to do an index on category_id first, customer_id second? > Note that Ignite will use only one index when joining two tables and that > in your case it should start with category_id. > > > > You can also try adding affinity key to this index in various places, see > if it helps further. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > ср, 19 сент. 2018 г. в 21:27, eugene miretsky <eugene.miret...@gmail.com>: > > Hi Ilya, > > > > I created 4 indexs on the table: > > 1) ga_pKey: on customer_id, dt, category_id (that's our primary key > columns) > > 2) ga_customer_and_category_id: on customer_id and category_id > > 2) ga_customer_id: on customer_id > > 4) ga_category_id: on category_id > > > > > > For the first query (category in ()), the execution plan when using the > first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */ > > When using #4 (alone or in combination with any of the other 3) > > 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /* > PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, > 101450) */ > > 2. The query runs slower. > > For the second query (join on an inlined table) the behaviour is very > similar. Using the first 3 indexes results in the same plan - using /* > PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID > */. > > When using #4 (alone or in combination with any of the other 3) > > 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = > CATS__Z1.CATEGORY_ID */ are used > > 2. The query is much slower. > > > > Theoretically the query seems pretty simple > > 1. Use affinity key to make sure the query runs in parallel and > there are no shuffles > > 2. Filter rows that match category_id using the category_id index > > 3. Used customer_id index for the group_by (not sure if this step > makes sense) > > But I cannot get it to work. > > > > Cheers, > > Eugene > > > > > > > > > > On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev < > ilya.kasnach...@gmail.com> wrote: > > Hello! > > > > I can see you try to use _key_PK as index. If your primary key is > composite, it won't work properly for you. I recommend creating an explicit > (category_id, customer_id) index. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > вт, 18 сент. 2018 г. в 17:47, eugene miretsky <eugene.miret...@gmail.com>: > > Hi Ilya, > > > > The different query result was my mistake - one of the categoy_ids was > duplicate, so in the query that used join, it counted rows for that > category twice. My apologies. > > > > However, we are still having an issue with query time, and the index not > being applied to category_id. Would appreciate if you could take a look. > > > > Cheers, > > Eugene > > > > On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > > Hello! > > > > Why don't you diff the results of those two queries, tell us what the > difference is? > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 17 сент. 2018 г. в 16:08, eugene miretsky <eugene.miret...@gmail.com>: > > Hello, > > > > Just wanted to see if anybody had time to look into this. > > > > Cheers, > > Eugene > > > > On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <eugene.miret...@gmail.com> > wrote: > > Thanks! > > > > Tried joining with an inlined table instead of IN as per the second > suggestion, and it didn't quite work. > > > > Query1: > > · Select COUNT(*) FROM( Select customer_id from GATABLE3 use > Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, > 6453) group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > > · exec time = 17s > > · *Result: 3105868* > > · Same exec time if using AFFINITY_KEY index or "_key_PK_hash or > customer_id index > > · Using an index on category_id increases the query time 33s > > Query2: > > · Select COUNT(*) FROM( Select customer_id from GATABLE3 ga use > index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, > 175930, 175940,175945,101450, 6453)) cats on cats.category_id = > ga.category_id group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > > · exec time = 38s > > · *Result: 3113921* > > · Same exec time if using AFFINITY_KEY index or "_key_PK_hash or > customer_id index or category_id index > > · Using an index on category_id doesnt change the run time > > Query plans are attached. > > > > 3 questions: > > 1. Why is the result differnt for the 2 queries - this is quite > concerning. > > 2. Why is the 2nd query taking longer > > 3. Why category_id index doesn't work in case of query 2. > > > > On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > > Hello! > > > > I don't think that we're able to use index with IN () clauses. Please > convert it into OR clauses. > > > > Please see > https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <andrey.mashen...@gmail.com > >: > > Hi > > > > Actually, first query uses index on affinity key which looks more > efficient than index on category_id column. > > The first query can process groups one by one and stream partial results > from map phase to reduce phase as it use sorted index lookup, > > while second query should process full dataset on map phase before pass it > for reducing. > > > > Try to use composite index (customer_id, category_id). > > > > Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build > more efficient plan when group by on collocated column is used. > > > > On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <eugene.miret...@gmail.com> > wrote: > > Hello, > > > > Schema: > > · > > PUBLIC.GATABLE2.CUSTOMER_ID > > PUBLIC.GATABLE2.DT > > PUBLIC.GATABLE2.CATEGORY_ID > > PUBLIC.GATABLE2.VERTICAL_ID > > PUBLIC.GATABLE2.SERVICE > > PUBLIC.GATABLE2.PRODUCT_VIEWS_APP > > PUBLIC.GATABLE2.PRODUCT_CLICKS_APP > > PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB > > PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB > > PUBLIC.GATABLE2.PDP_SESSIONS_APP > > PUBLIC.GATABLE2.PDP_SESSIONS_WEB > > · pkey = customer_id,dt > > · affinityKey = customer > > Query: > > · select COUNT(*) FROM( Select customer_id from GATABLE2 where > category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, > 6453) group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > > The table has 600M rows. > > At first, the query took 1m, when we added an index on category_id the > query started taking 3m. > > > > The SQL execution plan for both queries is attached. > > > > We are using a single x1.16xlarge insntace with query parallelism set to > 32 > > > > Cheers, > > Eugene > > > > > > > -- > > Best regards, > Andrey V. Mashenkov > > >