Re: Index usage on Joins

2020-06-09 Thread njcstreet
Thanks for your suggestion - I tried it but overall the query was slower
using this method. The best approach I have found is to put the items from
the group index directly on the table to avoid having to do the join.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Index usage on Joins

2020-06-08 Thread Ilya Kasnacheev
Hello!

This sounds like a good application point of *enforceJoinOrder=true*.
Consider:
3: jdbc:ignite:thin://localhost> !connect
jdbc:ignite:thin://localhost?enforceJoinOrder=true
4: jdbc:ignite:thin://localhost>
4: jdbc:ignite:thin://localhost> EXPLAIN SELECT f.Date_key,
. . . . . . . . . . . . . . . .> loc.Location_name,
. . . . . . . . . . . . . . . .> SUM(f.Revenue)
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> FROM DimensionProduct pr,
DimensionLocation loc, FactTableRevenue f
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> WHERE pr._key = f.Product_Key AND loc._key
= f.Location_Key
. . . . . . . . . . . . . . . .> AND f.Date_Key = 20200604
. . . . . . . . . . . . . . . .> AND pr.Product_Name = 'Product 1'
. . . . . . . . . . . . . . . .> AND loc.Location_Name IN ('London',
'Paris')
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> GROUP BY f.Date_Key, loc.Location_name;
PLAN  SELECT
F__Z2.DATE_KEY AS __C0_0,
LOC__Z1.LOCATION_NAME AS __C0_1,
SUM(F__Z2.REVENUE) AS __C0_2
FROM PUBLIC.DIMENSIONPRODUCT PR__Z0
/* PUBLIC.IX_PRODUCT_PRODUCT_NAME: PRODUCT_NAME = 'Product 1' */
/* WHERE PR__Z0.PRODUCT_NAME = 'Product 1'
*/
INNER JOIN PUBLIC.DIMENSIONLOCATION LOC__Z1
/* PUBLIC.IX_LOCATION_LOCATION_NAME: LOCATION_NAME IN('London',
'Paris') */
ON 1=1
/* WHERE LOC__Z1.LOCATION_NAME IN('London', 'Paris')
*/
INNER JOIN PUBLIC.FACTTABLEREVENUE F__Z2
/* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604
AND PRODUCT_KEY = PR__Z0._KEY
AND LOCATION_KEY = LOC__Z1._KEY
 */
ON 1=1
WHERE (LOC__Z1.LOCATION_NAME IN('London', 'Paris'))
AND ((PR__Z0.PRODUCT_NAME = 'Product 1')
AND ((F__Z2.DATE_KEY = 20200604)
AND ((PR__Z0._KEY = F__Z2.PRODUCT_KEY)
AND (LOC__Z1._KEY = F__Z2.LOCATION_KEY
GROUP BY F__Z2.DATE_KEY, LOC__Z1.LOCATION_NAME

PLAN  SELECT
__C0_0 AS DATE_KEY,
__C0_1 AS LOCATION_NAME,
CAST(CAST(SUM(__C0_2) AS DOUBLE) AS DOUBLE) AS __C0_2
FROM PUBLIC.__T0
/* PUBLIC."merge_scan" */
GROUP BY __C0_0, __C0_1

2 rows selected (0,011 seconds)

Is this what you wanted? First we filter pr and loc by varchar, then join
this small result to facts using secondary index.

Regards,
-- 
Ilya Kasnacheev


чт, 4 июн. 2020 г. в 16:58, njcstreet :

> Hi,
>
> I am evaluating Ignite for a data warehouse style system which would have a
> central very large "fact" table with potentially billions of records, and
> several "dimensions" that describe the data. The fact table would be
> partitioned as it is large, and the dimensions would be replicated across
> all nodes. I am using the latest version 2.8.
>
> My question is about index usage and joins. I need to join between the fact
> table (which has the numerical transaction values), and the dimensions
> which
> describe the data (such as product / location). However it seems that
> indexes on the fact table won't be used when joining. I understand that you
> can only use one index per table in a query, so I was hoping to use a group
> index for the query against the fact table, since there are a few
> attributes
> that users will always filter on. Here is an example schema (heavily
> simplified and with little data, but enough to demonstrate that the Explain
> plan is not using the index on the Fact table)
>
> CREATE TABLE IF NOT EXISTS FactTableRevenue (
>
> id int PRIMARY KEY,
>
> date_key int,
> product_key int,
> location_key int,
> revenue float
>
> ) WITH "template=partitioned,backups=0";
>
>
>
> CREATE TABLE IF NOT EXISTS DimensionProduct (
>
> id int PRIMARY KEY,
> product_name varchar
>
> ) WITH "TEMPLATE=REPLICATED";
>
>
>
> CREATE TABLE IF NOT EXISTS DimensionLocation (
>
> id int PRIMARY KEY,
> location_name varchar
>
> )WITH "TEMPLATE=REPLICATED";
>
>
>
> CREATE INDEX ix_product_product_name ON DimensionProduct(product_name);
> CREATE INDEX ix_location_location_name ON DimensionLocation(location_name);
> CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key,
> product_key, location_key);
>
>
> INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1');
> INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2');
> INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3');
>
> INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London');
> INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris');
> INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York');
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (1, 20200604, 1, 1, 500);
>
> INSERT INTO FactTableRevenue (id

Index usage on Joins

2020-06-04 Thread njcstreet
Hi,

I am evaluating Ignite for a data warehouse style system which would have a
central very large "fact" table with potentially billions of records, and
several "dimensions" that describe the data. The fact table would be
partitioned as it is large, and the dimensions would be replicated across
all nodes. I am using the latest version 2.8.

My question is about index usage and joins. I need to join between the fact
table (which has the numerical transaction values), and the dimensions which
describe the data (such as product / location). However it seems that
indexes on the fact table won't be used when joining. I understand that you
can only use one index per table in a query, so I was hoping to use a group
index for the query against the fact table, since there are a few attributes
that users will always filter on. Here is an example schema (heavily
simplified and with little data, but enough to demonstrate that the Explain
plan is not using the index on the Fact table)

CREATE TABLE IF NOT EXISTS FactTableRevenue (

id int PRIMARY KEY,

date_key int,
product_key int,
location_key int,
revenue float

) WITH "template=partitioned,backups=0";

 

CREATE TABLE IF NOT EXISTS DimensionProduct (

id int PRIMARY KEY,
product_name varchar

) WITH "TEMPLATE=REPLICATED";

 

CREATE TABLE IF NOT EXISTS DimensionLocation (

id int PRIMARY KEY,
location_name varchar

)WITH "TEMPLATE=REPLICATED";

 

CREATE INDEX ix_product_product_name ON DimensionProduct(product_name);
CREATE INDEX ix_location_location_name ON DimensionLocation(location_name);
CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key,
product_key, location_key);

 
INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1');
INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2');
INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3');

INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London');
INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris');
INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York');

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(1, 20200604, 1, 1, 500);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(2, 20200604, 1, 2, 700);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(3, 20200604, 1, 3, 90);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(4, 20200604, 2, 1, 267);

INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
revenue) VALUES
(5, 20200604, 2, 3, 755);

 

/* 1st example query with no joins but filling in the dimension keys up
front. This utilities the full group index which is great, but not how we
want to query the system as you would have to first run some queries to
determine the keys */

/* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: LOCATION_KEY IN(1, 2)

AND DATE_KEY = 20200604

AND PRODUCT_KEY = 1

 */

EXPLAIN SELECT f.Date_key,

loc.Location_name,
SUM(f.Revenue)
FROM FactTableRevenue f

INNER JOIN DimensionProduct pr
ON pr._key = f.Product_Key

INNER JOIN DimensionLocation loc
ON loc._key = f.Location_Key


WHERE f.Date_Key = 20200604
AND f.Product_key = 1 -- Filter specifically on key
AND f.Location_Key IN (1, 2) -- Filter specifically on key
GROUP BY f.Date_Key, loc.Location_name

 

/*  Second example - this is how we actually want to query by filtering on
specific dimension attributes. Here, only the date_key is used in the index
because there is no join on it. But product_key and location_key are not
included in the index usage because they are in joins */

-- key which is directly on Fact table

/* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 */

EXPLAIN SELECT f.Date_key,
loc.Location_name,
SUM(f.Revenue)
FROM FactTableRevenue f

INNER JOIN DimensionProduct pr
ON pr._key = f.Product_Key

INNER JOIN DimensionLocation loc
ON loc._key = f.Location_Key

WHERE f.Date_Key = 20200604
AND pr.Product_Name = 'Product 1'
AND loc.Location_Name IN ('London', 'Paris')

GROUP BY f.Date_Key, loc.Location_name


Now I know that I could put product_name and location_name directly on the
fact table, however these are not the only attributes that the user may want
to filter on. Also it will result in a much higher data usage because of
having to store strings across billions of rows rather than integers.

Thanks very much for your help!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/