Re: Tiles & Lattice HOWTO

Julian Hyde Mon, 04 May 2015 15:13:54 -0700

You are correct that the term "lattice" is borrowed from mathematics,
and in particular from the seminal paper[1]. But it's just a metaphor
for a data structure, just as are "cube".


By "schema" I mean the kind of thing one defines in ERWin - a
collection of tables connected by relationships. A "star schema" is a
well-known concept in the data warehousing world. It is a schema with
a particular structure, namely, that all relationships point
"outwards".

Your definition of a schema as a set of possible queries and the "can
be answered using" mathematical relation is a generalization of mine,
but I claim it is less useful. My lattice is defined structurally. It
only deals with filter-join-aggregate queries. Queries of this kind
are really important for interactive DW, and focusing on this kind of
queries makes the recommendation & matching algorithms tractable.

Calcite also lets you define materialized views outside of a lattice,
and it does its best to match queries to such materialized views, but
does less well. In particular, It doesn't match join queries well. The
search space for unifying a materialized view containing an N-way join
to a query containing an N-way join is exponential in N, so it doesn't
try.

If you define a materialized view within a lattice, the structure
provided by the lattice allows join queries to be matched very
quickly.

Julian

[1] "Implementing data cubes efficiently", Harinarayan, Rajaraman,
Ullmann 1997 
http://web.eecs.umich.edu/~jag/eecs584/papers/implementing_data_cube.pdf.

On Mon, May 4, 2015 at 2:28 PM, Ted Dunning <[email protected]> wrote:
> Julian,
>
> I think that what you are talking about is related to the mathematical
> definition of a lattice is a partially ordered set in which least upper
> bound and greatest lower bound are defined for every pair of elements.
>
> In this case, the objects in the set are aggregation queries and the
> partial ordering refers to the variables included in the group by.  You can
> take the ordering as to be defined between queries A and B such that A < B
> if the variables in the group by part of A is a strict subset of the
> variables in the group by part of B.  This means that the least upper bound
> between two queries as the query with the union of all grouping variables
> and the greatest lower bound as the query with the intersection of grouping
> variables.
>
> This is all straightforward except for bringing in a fair bit of
> terminology.
>
> I am confused by one thing you say, however, and I wonder if others are as
> well.
>
> You say "A lattice represents a star (or snowflake) schema, not a general
> schema".  If my interpretation of what you say is correct, the lattice is
> not a schema at all, but instead a set of queries that are defined over a
> schema with ordering defined as above.
>
> Is it merely a shorthand to refer to a schema as equivalent to the lattice
> composed of all the possible aggregation queries with all combinations of
> dimensions?
>
> On Mon, May 4, 2015 at 6:30 PM, Julian Hyde <[email protected]> wrote:
>
>> A lattice represents a star (or snowflake) schema, not a general
>> schema. In particular, all relationships must be many-to-one, heading
>> from a fact table at the center of the star.
>>
>> The lattice definition uses a SQL statement to represent the star. SQL
>> is a useful short-hand to represent several tables joined together,
>> and assigning aliases to the column names (it more convenient than
>> inventing a new language to represent relationships, join conditions
>> and cardinalities).
>>
>> Unlike regular SQL, order is important. If you put A before B in the
>> FROM clause, and make a join between A and B, you are saying that
>> there is a many-to-one foreign key relationship from A to B. (E.g. in
>> the example lattice, the Sales fact table occurs before the Time
>> dimension table, and before the Product dimension table. The Product
>> dimension table occurs before the ProductClass outer dimension table,
>> further down an arm of a snowflake.)
>>
>> A lattice implies constraints. In the A to B relationship, there is a
>> foreign key on A (i.e. every value of A's foreign key has a
>> corresponding value in B's key), and a unique key on B (i.e. no key
>> value occurs more than once). These constraints are really important,
>> because it allows the planner to remove joins to tables whose columns
>> are not being used, and know that the query results will not change.
>>
>> Calcite does not check these constraints. If they are violated,
>> Calcite will return wrong results.
>>
>> A lattice is a big, virtual join view. It is not materialized (it
>> would be several times larger than the star schema, because of
>> denormalization) and you probably wouldn't want to query it (far too
>> many columns). So what is it useful for? As we said above, (a) the
>> lattice declares some very useful primary and foreign key constraints,
>> (b) it helps the query planner map user queries onto
>> filter-join-aggregate materialized views (the most useful kind of
>> materialized view for DW queries), (c) gives Calcite a framework
>> within which to gather stats about data volumes and user queries, (d)
>> allows Calcite to automatically design and populate materialized
>> views.
>>
>> Most star schema models force you to choose whether a column is a
>> dimension or a measure. In a lattice, every column is a dimension
>> column. (I.e. it can become one of the columns in the GROUP BY clause
>> to query the star schema at a particular dimensionality). Any column
>> can also be used in a measure; you define measures by giving the
>> column and an aggregate function.
>>
>> If "unit_sales" tends to be used much more often as a measure rather
>> than a dimension, that's fine. Calcite's algorithm should notice that
>> it is rarely aggregated, and not be inclined to create tiles that
>> aggregate on it. (By "should" I mean "could and one day will". The
>> algorithm does not currently take query history into account when
>> designing tiles.)
>>
>> But someone might want to know whether orders with fewer than 5 items
>> were more or less profitable than orders with more than 100. All of a
>> sudden, "unit_sales" has become a dimension. If there's virtually zero
>> cost to declaring a column a dimension column, I figured let's make
>> them all dimension columns.
>>
>> The model allows for a particular table to be used more than once,
>> with a different table alias. You could use this to model say
>> OrderDate and ShipDate, with two uses to the Time dimension table.
>>
>> Most SQL systems require that the column names in a view are unique.
>> This is hard to achieve in a lattice, because you often include
>> primary and foreign key columns in a join. So Calcite lets you refer
>> to columns in two ways. If the column is unique, you can use its name,
>> ["unit_sales"]. Whether or not it is unique in the lattice, it will be
>> unique in its table, so you can use it qualified by its table alias.
>> Examples: ["sales", "unit_says", ["ship_date", "time_id"],
>> ["order_date", "time_id"].
>>
>> A "tile" is a materialized table in a lattice, with a particular
>> dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute
>> of the lattice JSON element (JsonLattice) defines an initial set of
>> tiles to materialize.
>>
>> If you run the algorithm, you can omit the tiles attribute. Calcite
>> will choose an initial set. If you include the tiles attribute, the
>> algorithm will start with that list and then start finding other tiles
>> that are complementary (i.e. "fill in the gaps" left by the initial
>> tiles).
>>
>> Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to
>> have to compute column cardinalities by executing a  "count(distinct
>> ...)" query for each column. We need to introduce a
>> CardinalityProvider SPI, with a default implementation that generates
>> SQL, but allows people to use more efficient means, e.g. database
>> stats or HyperLogLog. Column cardinalities (and joint cardinalities
>> such as "count(distinct state, city, zipcode)") are really important
>> to the algorithm. If computing cardinalities is slow the algorithm
>> will be slow, and if they are inaccurate the algorithm will produce
>> sub-optimal results.
>>
>> I should have written this all down in a blog post a long while ago.
>> Sorry I didn't. Now I have.
>>
>> Julian
>>
>>
>>
>> On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <[email protected]>
>> wrote:
>> >
>> > Can anyone point me to a short summary of how to construct and use the
>> "tiles" model parameter in Calcite? The slides at
>> http://www.slideshare.net/julianhyde/calcite-stratany2014 <
>> http://www.slideshare.net/julianhyde/calcite-stratany2014> are
>> intriguing, but I can't find any video of the presentation.
>> >
>> > And does it require a specific pre-existing star schema, or can it be
>> used with any arbitrary schema? Skimming
>> core/src/main/java/org/apache/calcite/materialize/Lattice.java it looks
>> like the CARDINALITY_MAP constant is hardwired to the foodmart database.
>> >
>>

Re: Tiles & Lattice HOWTO

Reply via email to