Good question - thanks for asking.
The short answer is that measures that span tables work, but they aren’t quite
as convenient as measures in a single table.
When designing Measures in SQL I knew that measures sometimes span tables -
star schemas are a case in point - but decided to make measures belong to one
table for simplicity. This allowed me to put them into the same namespace as
columns, so I can write ’select e.avg_sal from emp as e’. (Measures are not
columns, but SQL semantic analyzers already know how to look up a column within
the scope of a table.)
My goal was to allow dimensional expressions, such as revenue-this-year minus
revenue-last-year, so we need to know what are the dimensions that affect the
value of a measure. By associating a measure with a table we can say that the
(non-measure) columns of that table are its dimensions.
One case of 'measures spanning tables’ is joining a table with a measure to a
regular table. For example, suppose the orders table has an avg_shipping_cost
measure, and we write this query:
select o.zipcode, o.avg_shipping_cost, count(distinct i.color)
from orders as o
join orderItems as i on o.id <http://o.id/> = i.orderId
group by o.zipcode
If I’d written avg(o.shipping_cost), orders with many items would be weighted
more heavily in the total than orders with few items. But measures are ‘locked’
to the grain of their table.
This is a crucial property of measures. It allows me to create a wide join view
over a star schema - say of orders, order-items, customers and products - and
the measures in that view will be well-behaved. Users get the benefits of the
join without having to remember the right join conditions.
By the way, if I really want a weighted average shipping, I can unpack convert
the shipping-cost measure into a value, join it to the order-items, and then
re-package it as a measure again:
select o.zipcode, avg(o.avg_shipping_cost) as measure
weighted_avg_shipping_cost,
count(distinct i.color)
from orders as o
join orderItems as i on o.id <http://o.id/> = i.orderId
group by o.zipcode
Now, another case of ‘measures spanning tables’ is common dimensions (also
known as conformed dimensions). Let’s suppose I have an orders table and an
inventory table, both with measures, and both with a date column (orderDate and
inventoryDate) and a reference to a product. I can write a query that joins
these tables:
select o.sum_revenue, v.total_on_hand
from orders as o
join inventory as v
on o.orderDate = v.inventoryDate
and o.productId = v.productId
where o.color = ‘Red'
This is a many-to-many join — like joining two cubes in an OLAP system — but
the measures’ grain-locking ensures prevents double-counting.
Notice that I have had to manually put the common dimensions into a join
condition. There is nothing in the definition of the orders.sum_revenue and
inventory.total_on_hand measures that connects their date dimensions. I have to
remember to make that connection.
Where to store those definitions? I don’t know. I don’t think it would be a
good idea to build a view on this many-to-many join (too many columns, too much
complexity).
One idea is to store these common dimensions in the semantic layer. Perhaps a
'semantic layer' should have entities like ‘product’, ‘date’, ‘customer’ and
record which columns on the ‘cube’ tables connect to each entity. And then the
semantic layer can help people to write SQL. In my opinion, measures are not
really the semantic layer — they are already there, as part of the tables, but
their existence means that the semantic layer doesn’t have to worry about grain
and double-counting, and so can generate simpler SQL.
Julian
> On Sep 30, 2024, at 3:00 PM, Barak Alon <[email protected]> wrote:
>
> Hey there -
>
> I've found my way here via Measures in SQL
> <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time
> admirer of Calcite, but I'm not very familiar with internals - apologies if
> I stumble.
>
> I work on Airbnb's Minerva
> <https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70>
> project,
> an internal semantic layer for star schemas. Measures in SQL is a
> wonderfully elegant idea, and I've started exploring if we can introduce
> them to our system.
>
> However, our users often write queries that span multiple tables/stars,
> requiring a drill across operation. Our system handles this by exposing a
> virtual table that pretends as if our entire semantic layer is a single,
> wide fact table. Incompatible measure/dimension combinations result in
> runtime errors.
>
> This need feels pretty common and a logical extension to some of Measures'
> strengths - making SQL more powerful and expressive for "top down" queries.
>
> I'm curious - are there plans to support measures that span multiple tables
> in Calcite?
>
> - Barak