Hi, I am evaluating Ignite for a data warehouse style system which would have a central very large "fact" table with potentially billions of records, and several "dimensions" that describe the data. The fact table would be partitioned as it is large, and the dimensions would be replicated across all nodes. I am using the latest version 2.8.
My question is about index usage and joins. I need to join between the fact table (which has the numerical transaction values), and the dimensions which describe the data (such as product / location). However it seems that indexes on the fact table won't be used when joining. I understand that you can only use one index per table in a query, so I was hoping to use a group index for the query against the fact table, since there are a few attributes that users will always filter on. Here is an example schema (heavily simplified and with little data, but enough to demonstrate that the Explain plan is not using the index on the Fact table) CREATE TABLE IF NOT EXISTS FactTableRevenue ( id int PRIMARY KEY, date_key int, product_key int, location_key int, revenue float ) WITH "template=partitioned,backups=0"; CREATE TABLE IF NOT EXISTS DimensionProduct ( id int PRIMARY KEY, product_name varchar ) WITH "TEMPLATE=REPLICATED"; CREATE TABLE IF NOT EXISTS DimensionLocation ( id int PRIMARY KEY, location_name varchar )WITH "TEMPLATE=REPLICATED"; CREATE INDEX ix_product_product_name ON DimensionProduct(product_name); CREATE INDEX ix_location_location_name ON DimensionLocation(location_name); CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key, product_key, location_key); INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1'); INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2'); INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3'); INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London'); INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris'); INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York'); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (1, 20200604, 1, 1, 500); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (2, 20200604, 1, 2, 700); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (3, 20200604, 1, 3, 90); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (4, 20200604, 2, 1, 267); INSERT INTO FactTableRevenue (id, date_key, product_key, location_key, revenue) VALUES (5, 20200604, 2, 3, 755); /* 1st example query with no joins but filling in the dimension keys up front. This utilities the full group index which is great, but not how we want to query the system as you would have to first run some queries to determine the keys */ /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: LOCATION_KEY IN(1, 2) AND DATE_KEY = 20200604 AND PRODUCT_KEY = 1 */ EXPLAIN SELECT f.Date_key, loc.Location_name, SUM(f.Revenue) FROM FactTableRevenue f INNER JOIN DimensionProduct pr ON pr._key = f.Product_Key INNER JOIN DimensionLocation loc ON loc._key = f.Location_Key WHERE f.Date_Key = 20200604 AND f.Product_key = 1 -- Filter specifically on key AND f.Location_Key IN (1, 2) -- Filter specifically on key GROUP BY f.Date_Key, loc.Location_name /* Second example - this is how we actually want to query by filtering on specific dimension attributes. Here, only the date_key is used in the index because there is no join on it. But product_key and location_key are not included in the index usage because they are in joins */ -- key which is directly on Fact table /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 */ EXPLAIN SELECT f.Date_key, loc.Location_name, SUM(f.Revenue) FROM FactTableRevenue f INNER JOIN DimensionProduct pr ON pr._key = f.Product_Key INNER JOIN DimensionLocation loc ON loc._key = f.Location_Key WHERE f.Date_Key = 20200604 AND pr.Product_Name = 'Product 1' AND loc.Location_Name IN ('London', 'Paris') GROUP BY f.Date_Key, loc.Location_name Now I know that I could put product_name and location_name directly on the fact table, however these are not the only attributes that the user may want to filter on. Also it will result in a much higher data usage because of having to store strings across billions of rows rather than integers. Thanks very much for your help! -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/