Re: Index usage on Joins
Thanks for your suggestion - I tried it but overall the query was slower using this method. The best approach I have found is to put the items from the group index directly on the table to avoid having to do the join. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Index usage on Joins
Hello! This sounds like a good application point of *enforceJoinOrder=true*. Consider: 3: jdbc:ignite:thin://localhost> !connect jdbc:ignite:thin://localhost?enforceJoinOrder=true 4: jdbc:ignite:thin://localhost> 4: jdbc:ignite:thin://localhost> EXPLAIN SELECT f.Date_key, . . . . . . . . . . . . . . . .> loc.Location_name, . . . . . . . . . . . . . . . .> SUM(f.Revenue) . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . .> FROM DimensionProduct pr, DimensionLocation loc, FactTableRevenue f . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . .> WHERE pr._key = f.Product_Key AND loc._key = f.Location_Key . . . . . . . . . . . . . . . .> AND f.Date_Key = 20200604 . . . . . . . . . . . . . . . .> AND pr.Product_Name = 'Product 1' . . . . . . . . . . . . . . . .> AND loc.Location_Name IN ('London', 'Paris') . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . .> GROUP BY f.Date_Key, loc.Location_name; PLAN SELECT F__Z2.DATE_KEY AS __C0_0, LOC__Z1.LOCATION_NAME AS __C0_1, SUM(F__Z2.REVENUE) AS __C0_2 FROM PUBLIC.DIMENSIONPRODUCT PR__Z0 /* PUBLIC.IX_PRODUCT_PRODUCT_NAME: PRODUCT_NAME = 'Product 1' */ /* WHERE PR__Z0.PRODUCT_NAME = 'Product 1' */ INNER JOIN PUBLIC.DIMENSIONLOCATION LOC__Z1 /* PUBLIC.IX_LOCATION_LOCATION_NAME: LOCATION_NAME IN('London', 'Paris') */ ON 1=1 /* WHERE LOC__Z1.LOCATION_NAME IN('London', 'Paris') */ INNER JOIN PUBLIC.FACTTABLEREVENUE F__Z2 /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 AND PRODUCT_KEY = PR__Z0._KEY AND LOCATION_KEY = LOC__Z1._KEY */ ON 1=1 WHERE (LOC__Z1.LOCATION_NAME IN('London', 'Paris')) AND ((PR__Z0.PRODUCT_NAME = 'Product 1') AND ((F__Z2.DATE_KEY = 20200604) AND ((PR__Z0._KEY = F__Z2.PRODUCT_KEY) AND (LOC__Z1._KEY = F__Z2.LOCATION_KEY GROUP BY F__Z2.DATE_KEY, LOC__Z1.LOCATION_NAME PLAN SELECT __C0_0 AS DATE_KEY, __C0_1 AS LOCATION_NAME, CAST(CAST(SUM(__C0_2) AS DOUBLE) AS DOUBLE) AS __C0_2 FROM PUBLIC.__T0 /* PUBLIC."merge_scan" */ GROUP BY __C0_0, __C0_1 2 rows selected (0,011 seconds) Is this what you wanted? First we filter pr and loc by varchar, then join this small result to facts using secondary index. Regards, -- Ilya Kasnacheev чт, 4 июн. 2020 г. в 16:58, njcstreet : > Hi, > > I am evaluating Ignite for a data warehouse style system which would have a > central very large "fact" table with potentially billions of records, and > several "dimensions" that describe the data. The fact table would be > partitioned as it is large, and the dimensions would be replicated across > all nodes. I am using the latest version 2.8. > > My question is about index usage and joins. I need to join between the fact > table (which has the numerical transaction values), and the dimensions > which > describe the data (such as product / location). However it seems that > indexes on the fact table won't be used when joining. I understand that you > can only use one index per table in a query, so I was hoping to use a group > index for the query against the fact table, since there are a few > attributes > that users will always filter on. Here is an example schema (heavily > simplified and with little data, but enough to demonstrate that the Explain > plan is not using the index on the Fact table) > > CREATE TABLE IF NOT EXISTS FactTableRevenue ( > > id int PRIMARY KEY, > > date_key int, > product_key int, > location_key int, > revenue float > > ) WITH "template=partitioned,backups=0"; > > > > CREATE TABLE IF NOT EXISTS DimensionProduct ( > > id int PRIMARY KEY, > product_name varchar > > ) WITH "TEMPLATE=REPLICATED"; > > > > CREATE TABLE IF NOT EXISTS DimensionLocation ( > > id int PRIMARY KEY, > location_name varchar > > )WITH "TEMPLATE=REPLICATED"; > > > > CREATE INDEX ix_product_product_name ON DimensionProduct(product_name); > CREATE INDEX ix_location_location_name ON DimensionLocation(location_name); > CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key, > product_key, location_key); > > > INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1'); > INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2'); > INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3'); > > INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London'); > INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris'); > INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York'); > > INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, > revenue) VALUES > (1, 20200604, 1, 1, 500); > > INSERT INTO FactTableRevenue (id
Index usage on Joins
Hi, I am evaluating Ignite for a data warehouse style system which would have a central very large "fact" table with potentially billions of records, and several "dimensions" that describe the data. The fact table would be partitioned as it is large, and the dimensions would be replicated across all nodes. I am using the latest version 2.8. My question is about index usage and joins. I need to join between the fact table (which has the numerical transaction values), and the dimensions which describe the data (such as product / location). However it seems that indexes on the fact table won't be used when joining. I understand that you can only use one index per table in a query, so I was hoping to use a group index for the query against the fact table, since there are a few attributes that users will always filter on. Here is an example schema (heavily simplified and with little data, but enough to demonstrate that the Explain plan is not using the index on the Fact table) CREATE TABLE IF NOT EXISTS FactTableRevenue ( id int PRIMARY KEY, date_key int, product_key int, location_key int, revenue float ) WITH "template=partitioned,backups=0"; CREATE TABLE IF NOT EXISTS DimensionProduct ( id int PRIMARY KEY, product_name varchar ) WITH "TEMPLATE=REPLICATED"; CREATE TABLE IF NOT EXISTS DimensionLocation ( id int PRIMARY KEY, location_name varchar )WITH "TEMPLATE=REPLICATED"; CREATE INDEX ix_product_product_name ON DimensionProduct(product_name); CREATE INDEX ix_location_location_name ON DimensionLocation(location_name); CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key, product_key, location_key); INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1'); INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2'); INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3'); INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London'); INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris'); INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York'); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (1, 20200604, 1, 1, 500); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (2, 20200604, 1, 2, 700); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (3, 20200604, 1, 3, 90); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (4, 20200604, 2, 1, 267); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (5, 20200604, 2, 3, 755); /* 1st example query with no joins but filling in the dimension keys up front. This utilities the full group index which is great, but not how we want to query the system as you would have to first run some queries to determine the keys */ /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: LOCATION_KEY IN(1, 2) AND DATE_KEY = 20200604 AND PRODUCT_KEY = 1 */ EXPLAIN SELECT f.Date_key, loc.Location_name, SUM(f.Revenue) FROM FactTableRevenue f INNER JOIN DimensionProduct pr ON pr._key = f.Product_Key INNER JOIN DimensionLocation loc ON loc._key = f.Location_Key WHERE f.Date_Key = 20200604 AND f.Product_key = 1 -- Filter specifically on key AND f.Location_Key IN (1, 2) -- Filter specifically on key GROUP BY f.Date_Key, loc.Location_name /* Second example - this is how we actually want to query by filtering on specific dimension attributes. Here, only the date_key is used in the index because there is no join on it. But product_key and location_key are not included in the index usage because they are in joins */ -- key which is directly on Fact table /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 */ EXPLAIN SELECT f.Date_key, loc.Location_name, SUM(f.Revenue) FROM FactTableRevenue f INNER JOIN DimensionProduct pr ON pr._key = f.Product_Key INNER JOIN DimensionLocation loc ON loc._key = f.Location_Key WHERE f.Date_Key = 20200604 AND pr.Product_Name = 'Product 1' AND loc.Location_Name IN ('London', 'Paris') GROUP BY f.Date_Key, loc.Location_name Now I know that I could put product_name and location_name directly on the fact table, however these are not the only attributes that the user may want to filter on. Also it will result in a much higher data usage because of having to store strings across billions of rows rather than integers. Thanks very much for your help! -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/