Hi,

I am evaluating Ignite for a data warehouse style system which would have a
central very large "fact" table with potentially billions of records, and
several "dimensions" that describe the data. The fact table would be
partitioned as it is large, and the dimensions would be replicated across
all nodes. I am using the latest version 2.8.

My question is about index usage and joins. I need to join between the fact
table (which has the numerical transaction values), and the dimensions which
describe the data (such as product / location). However it seems that
indexes on the fact table won't be used when joining. I understand that you
can only use one index per table in a query, so I was hoping to use a group
index for the query against the fact table, since there are a few attributes
that users will always filter on. Here is an example schema (heavily
simplified and with little data, but enough to demonstrate that the Explain
plan is not using the index on the Fact table)

CREATE TABLE IF NOT EXISTS FactTableRevenue (

id int PRIMARY KEY,

date_key int,
product_key int,
location_key int,
revenue float

) WITH "template=partitioned,backups=0";

 

CREATE TABLE IF NOT EXISTS DimensionProduct (

id int PRIMARY KEY,
product_name varchar

) WITH "TEMPLATE=REPLICATED";

 

CREATE TABLE IF NOT EXISTS DimensionLocation (

id int PRIMARY KEY,
location_name varchar

)WITH "TEMPLATE=REPLICATED";

 

CREATE INDEX ix_product_product_name ON DimensionProduct(product_name);
CREATE INDEX ix_location_location_name ON DimensionLocation(location_name);
CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key,
product_key, location_key);

 
INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1');
INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2');
INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3');

INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London');
INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris');
INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York');

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(1, 20200604, 1, 1, 500);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(2, 20200604, 1, 2, 700);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(3, 20200604, 1, 3, 90);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(4, 20200604, 2, 1, 267);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(5, 20200604, 2, 3, 755);

 

/* 1st example query with no joins but filling in the dimension keys up
front. This utilities the full group index which is great, but not how we
want to query the system as you would have to first run some queries to
determine the keys */

/* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: LOCATION_KEY IN(1, 2)

        AND DATE_KEY = 20200604

        AND PRODUCT_KEY = 1

     */

EXPLAIN SELECT f.Date_key,

loc.Location_name,
SUM(f.Revenue)
FROM FactTableRevenue f

INNER JOIN DimensionProduct pr
ON pr._key = f.Product_Key

INNER JOIN DimensionLocation loc
ON loc._key = f.Location_Key


WHERE f.Date_Key = 20200604
AND f.Product_key = 1 -- Filter specifically on key
AND f.Location_Key IN (1, 2) -- Filter specifically on key
GROUP BY f.Date_Key, loc.Location_name

 

/*  Second example - this is how we actually want to query by filtering on
specific dimension attributes. Here, only the date_key is used in the index
because there is no join on it. But product_key and location_key are not
included in the index usage because they are in joins */

-- key which is directly on Fact table

/* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 */

EXPLAIN SELECT f.Date_key,
loc.Location_name,
SUM(f.Revenue)
FROM FactTableRevenue f

INNER JOIN DimensionProduct pr
ON pr._key = f.Product_Key

INNER JOIN DimensionLocation loc
ON loc._key = f.Location_Key

WHERE f.Date_Key = 20200604
AND pr.Product_Name = 'Product 1'
AND loc.Location_Name IN ('London', 'Paris')

GROUP BY f.Date_Key, loc.Location_name


Now I know that I could put product_name and location_name directly on the
fact table, however these are not the only attributes that the user may want
to filter on. Also it will result in a much higher data usage because of
having to store strings across billions of rows rather than integers.

Thanks very much for your help!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to