Re: Tiles & Lattice HOWTO

Marc Prud'hommeaux Fri, 08 May 2015 17:35:16 -0700

Julian-

Thanks very much for the detailed explanation! It gives me enough to go on to 
experiment with using it for my purposes (ad-hoc generation of lattice 
structures).


Regarding the CARDINALITY_MAP, it might be less of a hack if it was just moved 
into the model; i.e., have a CARDINALITY_MAP array in 
hsqldb-foodmart-lattice-model.json rather than hardcoded into Lattice.java. 
Also, as we discussed a few years ago at 
<http://julianhyde.blogspot.com/2012/04/how-should-mondrian-get-table-and.html>,
 we might get some mileage from using DatabaseMetadata.getIndexInfo for 
cardinality estimates.

        -Marc


> On May 4, 2015, at 1:30 PM, Julian Hyde <[email protected]> wrote:
> 
> A lattice represents a star (or snowflake) schema, not a general
> schema. In particular, all relationships must be many-to-one, heading
> from a fact table at the center of the star.
> 
> The lattice definition uses a SQL statement to represent the star. SQL
> is a useful short-hand to represent several tables joined together,
> and assigning aliases to the column names (it more convenient than
> inventing a new language to represent relationships, join conditions
> and cardinalities).
> 
> Unlike regular SQL, order is important. If you put A before B in the
> FROM clause, and make a join between A and B, you are saying that
> there is a many-to-one foreign key relationship from A to B. (E.g. in
> the example lattice, the Sales fact table occurs before the Time
> dimension table, and before the Product dimension table. The Product
> dimension table occurs before the ProductClass outer dimension table,
> further down an arm of a snowflake.)
> 
> A lattice implies constraints. In the A to B relationship, there is a
> foreign key on A (i.e. every value of A's foreign key has a
> corresponding value in B's key), and a unique key on B (i.e. no key
> value occurs more than once). These constraints are really important,
> because it allows the planner to remove joins to tables whose columns
> are not being used, and know that the query results will not change.
> 
> Calcite does not check these constraints. If they are violated,
> Calcite will return wrong results.
> 
> A lattice is a big, virtual join view. It is not materialized (it
> would be several times larger than the star schema, because of
> denormalization) and you probably wouldn't want to query it (far too
> many columns). So what is it useful for? As we said above, (a) the
> lattice declares some very useful primary and foreign key constraints,
> (b) it helps the query planner map user queries onto
> filter-join-aggregate materialized views (the most useful kind of
> materialized view for DW queries), (c) gives Calcite a framework
> within which to gather stats about data volumes and user queries, (d)
> allows Calcite to automatically design and populate materialized
> views.
> 
> Most star schema models force you to choose whether a column is a
> dimension or a measure. In a lattice, every column is a dimension
> column. (I.e. it can become one of the columns in the GROUP BY clause
> to query the star schema at a particular dimensionality). Any column
> can also be used in a measure; you define measures by giving the
> column and an aggregate function.
> 
> If "unit_sales" tends to be used much more often as a measure rather
> than a dimension, that's fine. Calcite's algorithm should notice that
> it is rarely aggregated, and not be inclined to create tiles that
> aggregate on it. (By "should" I mean "could and one day will". The
> algorithm does not currently take query history into account when
> designing tiles.)
> 
> But someone might want to know whether orders with fewer than 5 items
> were more or less profitable than orders with more than 100. All of a
> sudden, "unit_sales" has become a dimension. If there's virtually zero
> cost to declaring a column a dimension column, I figured let's make
> them all dimension columns.
> 
> The model allows for a particular table to be used more than once,
> with a different table alias. You could use this to model say
> OrderDate and ShipDate, with two uses to the Time dimension table.
> 
> Most SQL systems require that the column names in a view are unique.
> This is hard to achieve in a lattice, because you often include
> primary and foreign key columns in a join. So Calcite lets you refer
> to columns in two ways. If the column is unique, you can use its name,
> ["unit_sales"]. Whether or not it is unique in the lattice, it will be
> unique in its table, so you can use it qualified by its table alias.
> Examples: ["sales", "unit_says", ["ship_date", "time_id"],
> ["order_date", "time_id"].
> 
> A "tile" is a materialized table in a lattice, with a particular
> dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute
> of the lattice JSON element (JsonLattice) defines an initial set of
> tiles to materialize.
> 
> If you run the algorithm, you can omit the tiles attribute. Calcite
> will choose an initial set. If you include the tiles attribute, the
> algorithm will start with that list and then start finding other tiles
> that are complementary (i.e. "fill in the gaps" left by the initial
> tiles).
> 
> Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to
> have to compute column cardinalities by executing a  "count(distinct
> ...)" query for each column. We need to introduce a
> CardinalityProvider SPI, with a default implementation that generates
> SQL, but allows people to use more efficient means, e.g. database
> stats or HyperLogLog. Column cardinalities (and joint cardinalities
> such as "count(distinct state, city, zipcode)") are really important
> to the algorithm. If computing cardinalities is slow the algorithm
> will be slow, and if they are inaccurate the algorithm will produce
> sub-optimal results.
> 
> I should have written this all down in a blog post a long while ago.
> Sorry I didn't. Now I have.
> 
> Julian
> 
> 
> 
> On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <[email protected]> wrote:
>> 
>> Can anyone point me to a short summary of how to construct and use the 
>> "tiles" model parameter in Calcite? The slides at 
>> http://www.slideshare.net/julianhyde/calcite-stratany2014 
>> <http://www.slideshare.net/julianhyde/calcite-stratany2014> are intriguing, 
>> but I can't find any video of the presentation.
>> 
>> And does it require a specific pre-existing star schema, or can it be used 
>> with any arbitrary schema? Skimming 
>> core/src/main/java/org/apache/calcite/materialize/Lattice.java it looks like 
>> the CARDINALITY_MAP constant is hardwired to the foodmart database.
>>

Re: Tiles & Lattice HOWTO

Reply via email to