Re: [Discussion] Measures that span multiple tables

Julian Hyde Sat, 05 Oct 2024 14:00:21 -0700

There are a lot of people who think that the semantic layer should
just have a simple query language - pick a few dimensions and
measures, choose a sort order, and you're done - but I think the
semantic layer should have a real query language, equivalent in
strength to SQL, and powerful enough to build models.


I don't want to introduce a proprietary language (like DAX or LOD) so
we're left with SQL or SPARQL. Some syntactic sugar on top of SQL
could find paths through the graph and convert them to joins. That's
probably good enough.

On Sat, Oct 5, 2024 at 5:23 AM Barak Alon <barak.s.a...@gmail.com> wrote:
>
> For the record - I agree that the semantic layer is probably a better place
> for this. But I would've said something similar about measures before I
> read your paper haha. So I'm pushing on this mostly out of curiosity.
>
> Agreed - this is essentially a virtual cube. And MDX (or SPARQL) might be a
> better language for this type of query - but alas, SQL is king.
>
> > There are so many columns it’s impossible to give them meaningful names.
>
> Is that right? Maybe I'm thinking of it slightly differently.
>
> Yes, something has to flatten the graph. I think that's the job of the
> semantic layer. And it's got to choose meaningful names - hard, but not
> impossible. At least if thinking from the perspective of a particular
> business domain.
>
> The semantic layer first has to flatten each cube and conform the
> dimensions:
>
> ```
> crew view orders_cube as
> select
>   SUM(o.revenue) as measure sum_revenue,
>   l.country as origin_country
> from orders o
> left join customers c
>   on o.customerId = c.id
> left join locations l
>   on c.locationId = l.id
>
>
> crew view clicks_cube as
> select
>   COUNT(DISTINCT c.cookieId) as measure unique_clickers,
>   l.country as origin_country
> from clicks c
> left join sessions s
>   on c.sessionId = s.id
> left join locations l
>   on s.locationId = l.id
>
>
> create view __all as
> select
>   coalesce(o.origin_country, c.origin_country) as origin_country,
>   o.sum_revenue,
>   c.unique_clickers,
> from orders_cube o
> full outer join clicks_cube c
>   on o.origin_country = c.origin_country
> ```
>
> Yes, this is done at DDL time. And yes, this certainly can't answer all
> types of queries. But I think it works for simple slicing and dicing on a
> virtual cube?
>
> On Fri, Oct 4, 2024 at 6:52 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>
> > Let’s review where we are.
> >  * We have several fact tables (each representing a business process -
> > taking orders, shipping orders, canceling orders, returning products,
> > storing products in a warehouse) and we have surrounded each fact table
> > with dimension tables and defined some measures in a ’star join’ view or
> > ‘cube’.
> >  * We can join several cubes (or regular tables) in a single query,
> > joining on their common dimensions. If the query has no GROUP BY the number
> > of rows will be astronomical, but if there is a GROUP BY the number of rows
> > will be reasonable, and because measures manage their own grain their
> > values will be correct.
> >
> > This is, in essence, what you can achieve with virtual cubes in MDX [
> > https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms709837(v=vs.85)
> > ], and furthermore you don’t have to do it at DDL time, you can do it at
> > query time.
> >
> > This looks like a graph: cubes connected, via its dimensions, to entities
> > (date, customer, product, warehouse, campaign). If two cubes have at least
> > one entity in common you can join them. A query would seem to be just a
> > list of attributes (of entities) and a list of measures, provided that
> > there is a path joining them.
> >
> > But we have a problem. The problem is that as the graph gets larger, it
> > begins to have (undirected) cycles: there is more than one path between
> > some pairs of entities. For example, you are interested in customers and
> > related products. It is ambiguous whether you mean customers that have
> > purchased a product, or customers who have been targeted in a marketing
> > campaign for a product.
> >
> > That’s the problem with Barak’s ‘all’ table which represents a
> > flattened-graph. It’s not the large number of columns (the union of columns
> > from all the cubes and entities). Nor is it the colossal number of rows
> > (effectively the cartesian product of all entities). It’s that in order to
> > flatten the graph you need to repeat each entity several times, once for
> > each path. For a given product there are not just customers; there are
> > order-customers, and campaign-customers, and product-return-customers.
> > There are so many columns it’s impossible to give them meaningful names.
> >
> > To disambiguate, use measures. “Show me all customers and products for
> > which campaignClicks was greater than 3 in 2023 and orderCount was greater
> > than zero in 2023 or 2024.” The measures belong to cubes, and therefore the
> > join path between the entities (dimension tables) must pass through their
> > fact tables.
> >
> > I don’t think that preceding query can be expressed in SQL. Even if the
> > cubes, entities, measures etc. can all be represented as tables, SQL (quite
> > rightly) requires joins to be explicitly stated. I think there is a
> > higher-level “graph” query language which can resolve join paths and spit
> > out a SQL query with explicit joins. (Let’s start talking about adding
> > measures to SPARQL.)
> >
> > As for query-planning. Predicates on measures are a generalization of semi
> > joins. In fact “orderCount > 0” is exactly the same as “where exists
> > (select * from orders where pid = products.pid and cid = customers.cid)”.
> > So a query that says “cartesian product of all products and all customers,
> > with the side condition that orderCount > 0” becomes a join graph.
> >
> > The join graph may not have a tight tree structure - so techniques for
> > planning and executing “bushy trees” [1] may be helpful.
> >
> > Julian
> >
> >
> > [1] https://www.csd.uoc.gr/~hy460/pdf/Left-deep_vs_bushy_trees.pdf
> >
> >
> > > On Oct 2, 2024, at 6:52 AM, Barak Alon <barak.s.a...@gmail.com> wrote:
> > >
> > > Got it, that makes a lot of sense.
> > >
> > > Common dimensions is the use case I was thinking of.
> > >
> > > Our semantic layer exposes a SQL interface that allows for queries
> > similar
> > > to this:
> > >
> > > ```
> > > select
> > >  productId,
> > >  sum_revenue,
> > >  total_on_hand
> > > from __all
> > > where color = 'Red'
> > > group by productId
> > > ```
> > >
> > > It knows that sum_revenue comes from orders, total_on_hand comes from
> > > inventory, and that productId and color are common dimensions. So it can
> > > rewrite this query as:
> > >
> > > ```
> > > select
> > >  coalesce(o.productId, v.productId) as productId,
> > >  o.sum_revenue,
> > >  v.total_on_hand
> > > from (
> > >  select
> > >    productId,
> > >    sum_revenue
> > >  from orders
> > >  where color = 'Red'
> > >  group by productId
> > > ) as o
> > > full outer join (
> > >  select
> > >    productId,
> > >    total_on_hand
> > >  from inventory
> > >  where color = 'Red'
> > >  group by productId
> > > ) as v
> > >  on o.productId = v.productId
> > > ```
> > >
> > > So you could define measures on this __all abomination - say,
> > > revenue_per_total = sum_revenue / total_on_hand.
> > >
> > > This has a lot of similar benefits proposed in the paper. For example,
> > this
> > > is an even easier target for generative AI - the model doesn't need to
> > know
> > > how to join cubes.
> > >
> > > re: your statement "I don’t think it would be a good idea to build a view
> > > on this many-to-many join (too many columns, too much complexity)."
> > >
> > > I want to dig into what such a view would look like. I'm imagining:
> > >
> > > ```
> > > create view __all as
> > > select
> > >  coalesce(o.productId, v.productId) as productId,
> > >  coalesce(o.color, v.color) as color,
> > >  -- ... many more
> > >
> > >  o.sum_revenue,
> > >  v.total_on_hand,
> > >  o.sum_revenue / v.total_on_hand as measure revenue_per_total,  -- a
> > > measure that spans multiple tables
> > >  -- ... many more
> > >
> > > from o
> > > full outer join v
> > >  on o.productId = v.productId
> > >  and o.color = v.color
> > >  -- ... many more
> > >
> > > -- .. many more
> > > ```
> > >
> > > A lot of challenges here:
> > > Efficiently dealing with so many columns?
> > > Is it possible for an optimizer rule to prune the unused join conditions?
> > > Is there sleeker syntax that would make defining such a view more
> > tenable?
> > >
> > > Does such complexity belong as a fundamental concept like Measures or in
> > > some semantic layer with custom rewrite rules? I don't know. But I do
> > think
> > > there is value in such an abstraction, and it would be cool if it was
> > > standardized.
> > >
> > > On Tue, Oct 1, 2024 at 6:36 PM Julian Hyde <jhyde.apa...@gmail.com
> > <mailto:jhyde.apa...@gmail.com>> wrote:
> > >
> > >> Good question - thanks for asking.
> > >>
> > >> The short answer is that measures that span tables work, but they aren’t
> > >> quite as convenient as measures in a single table.
> > >>
> > >> When designing Measures in SQL I knew that measures sometimes span
> > tables
> > >> - star schemas are a case in point - but decided to make measures
> > belong to
> > >> one table for simplicity. This allowed me to put them into the same
> > >> namespace as columns, so I can write ’select e.avg_sal from emp as e’.
> > >> (Measures are not columns, but SQL semantic analyzers already know how
> > to
> > >> look up a column within the scope of a table.)
> > >>
> > >> My goal was to allow dimensional expressions, such as revenue-this-year
> > >> minus revenue-last-year, so we need to know what are the dimensions that
> > >> affect the value of a measure. By associating a measure with a table we
> > can
> > >> say that the (non-measure) columns of that table are its dimensions.
> > >>
> > >> One case of 'measures spanning tables’ is joining a table with a measure
> > >> to a regular table. For example, suppose the orders table has an
> > >> avg_shipping_cost measure, and we write this query:
> > >>
> > >>  select o.zipcode, o.avg_shipping_cost, count(distinct i.color)
> > >>  from orders as o
> > >>    join orderItems as i on o.id <http://o.id/> <http://o.id/> =
> > i.orderId
> > >>  group by o.zipcode
> > >>
> > >> If I’d written avg(o.shipping_cost), orders with many items would be
> > >> weighted more heavily in the total than orders with few items. But
> > measures
> > >> are ‘locked’ to the grain of their table.
> > >>
> > >> This is a crucial property of measures. It allows me to create a wide
> > join
> > >> view over a star schema - say of orders, order-items, customers and
> > >> products - and the measures in that view will be well-behaved. Users get
> > >> the benefits of the join without having to remember the right join
> > >> conditions.
> > >>
> > >> By the way, if I really want a weighted average shipping, I can unpack
> > >> convert the shipping-cost measure into a value, join it to the
> > order-items,
> > >> and then re-package it as a measure again:
> > >>
> > >>  select o.zipcode, avg(o.avg_shipping_cost) as measure
> > >> weighted_avg_shipping_cost,
> > >>      count(distinct i.color)
> > >>  from orders as o
> > >>    join orderItems as i on o.id <http://o.id/> <http://o.id/> =
> > i.orderId
> > >>  group by o.zipcode
> > >>
> > >> Now, another case of ‘measures spanning tables’ is common dimensions
> > (also
> > >> known as conformed dimensions). Let’s suppose I have an orders table
> > and an
> > >> inventory table, both with measures, and both with a date column
> > (orderDate
> > >> and inventoryDate) and a reference to a product. I can write a query
> > that
> > >> joins these tables:
> > >>
> > >>  select o.sum_revenue, v.total_on_hand
> > >>  from orders as o
> > >>    join inventory as v
> > >>    on o.orderDate = v.inventoryDate
> > >>    and  o.productId = v.productId
> > >>  where o.color = ‘Red'
> > >>
> > >> This is a many-to-many join — like joining two cubes in an OLAP system —
> > >> but the measures’ grain-locking ensures prevents double-counting.
> > >>
> > >> Notice that I have had to manually put the common dimensions into a join
> > >> condition. There is nothing in the definition of the orders.sum_revenue
> > and
> > >> inventory.total_on_hand measures that connects their date dimensions. I
> > >> have to remember to make that connection.
> > >>
> > >> Where to store those definitions? I don’t know. I don’t think it would
> > be
> > >> a good idea to build a view on this many-to-many join (too many columns,
> > >> too much complexity).
> > >>
> > >> One idea is to store these common dimensions in the semantic layer.
> > >> Perhaps a 'semantic layer' should have entities like ‘product’, ‘date’,
> > >> ‘customer’ and record which columns on the ‘cube’ tables connect to each
> > >> entity. And then the semantic layer can help people to write SQL. In my
> > >> opinion, measures are not really the semantic layer — they are already
> > >> there, as part of the tables, but their existence means that the
> > semantic
> > >> layer doesn’t have to worry about grain and double-counting, and so can
> > >> generate simpler SQL.
> > >>
> > >> Julian
> > >>
> > >>
> > >>
> > >>> On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com>
> > wrote:
> > >>>
> > >>> Hey there -
> > >>>
> > >>> I've found my way here via Measures in SQL
> > >>> <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time
> > >>> admirer of Calcite, but I'm not very familiar with internals -
> > apologies
> > >> if
> > >>> I stumble.
> > >>>
> > >>> I work on Airbnb's Minerva
> > >>> <
> > >>
> > https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70
> > >>>
> > >>> project,
> > >>> an internal semantic layer for star schemas. Measures in SQL is a
> > >>> wonderfully elegant idea, and I've started exploring if we can
> > introduce
> > >>> them to our system.
> > >>>
> > >>> However, our users often write queries that span multiple tables/stars,
> > >>> requiring a drill across operation. Our system handles this by
> > exposing a
> > >>> virtual table that pretends as if our entire semantic layer is a
> > single,
> > >>> wide fact table. Incompatible measure/dimension combinations result in
> > >>> runtime errors.
> > >>>
> > >>> This need feels pretty common and a logical extension to some of
> > >> Measures'
> > >>> strengths - making SQL more powerful and expressive for "top down"
> > >> queries.
> > >>>
> > >>> I'm curious - are there plans to support measures that span multiple
> > >> tables
> > >>> in Calcite?
> > >>>
> > >>> - Barak
> >
> >

Re: [Discussion] Measures that span multiple tables

Reply via email to