Julian- Thanks very much for the detailed explanation! It gives me enough to go on to experiment with using it for my purposes (ad-hoc generation of lattice structures).
Regarding the CARDINALITY_MAP, it might be less of a hack if it was just moved into the model; i.e., have a CARDINALITY_MAP array in hsqldb-foodmart-lattice-model.json rather than hardcoded into Lattice.java. Also, as we discussed a few years ago at <http://julianhyde.blogspot.com/2012/04/how-should-mondrian-get-table-and.html>, we might get some mileage from using DatabaseMetadata.getIndexInfo for cardinality estimates. -Marc > On May 4, 2015, at 1:30 PM, Julian Hyde <[email protected]> wrote: > > A lattice represents a star (or snowflake) schema, not a general > schema. In particular, all relationships must be many-to-one, heading > from a fact table at the center of the star. > > The lattice definition uses a SQL statement to represent the star. SQL > is a useful short-hand to represent several tables joined together, > and assigning aliases to the column names (it more convenient than > inventing a new language to represent relationships, join conditions > and cardinalities). > > Unlike regular SQL, order is important. If you put A before B in the > FROM clause, and make a join between A and B, you are saying that > there is a many-to-one foreign key relationship from A to B. (E.g. in > the example lattice, the Sales fact table occurs before the Time > dimension table, and before the Product dimension table. The Product > dimension table occurs before the ProductClass outer dimension table, > further down an arm of a snowflake.) > > A lattice implies constraints. In the A to B relationship, there is a > foreign key on A (i.e. every value of A's foreign key has a > corresponding value in B's key), and a unique key on B (i.e. no key > value occurs more than once). These constraints are really important, > because it allows the planner to remove joins to tables whose columns > are not being used, and know that the query results will not change. > > Calcite does not check these constraints. If they are violated, > Calcite will return wrong results. > > A lattice is a big, virtual join view. It is not materialized (it > would be several times larger than the star schema, because of > denormalization) and you probably wouldn't want to query it (far too > many columns). So what is it useful for? As we said above, (a) the > lattice declares some very useful primary and foreign key constraints, > (b) it helps the query planner map user queries onto > filter-join-aggregate materialized views (the most useful kind of > materialized view for DW queries), (c) gives Calcite a framework > within which to gather stats about data volumes and user queries, (d) > allows Calcite to automatically design and populate materialized > views. > > Most star schema models force you to choose whether a column is a > dimension or a measure. In a lattice, every column is a dimension > column. (I.e. it can become one of the columns in the GROUP BY clause > to query the star schema at a particular dimensionality). Any column > can also be used in a measure; you define measures by giving the > column and an aggregate function. > > If "unit_sales" tends to be used much more often as a measure rather > than a dimension, that's fine. Calcite's algorithm should notice that > it is rarely aggregated, and not be inclined to create tiles that > aggregate on it. (By "should" I mean "could and one day will". The > algorithm does not currently take query history into account when > designing tiles.) > > But someone might want to know whether orders with fewer than 5 items > were more or less profitable than orders with more than 100. All of a > sudden, "unit_sales" has become a dimension. If there's virtually zero > cost to declaring a column a dimension column, I figured let's make > them all dimension columns. > > The model allows for a particular table to be used more than once, > with a different table alias. You could use this to model say > OrderDate and ShipDate, with two uses to the Time dimension table. > > Most SQL systems require that the column names in a view are unique. > This is hard to achieve in a lattice, because you often include > primary and foreign key columns in a join. So Calcite lets you refer > to columns in two ways. If the column is unique, you can use its name, > ["unit_sales"]. Whether or not it is unique in the lattice, it will be > unique in its table, so you can use it qualified by its table alias. > Examples: ["sales", "unit_says", ["ship_date", "time_id"], > ["order_date", "time_id"]. > > A "tile" is a materialized table in a lattice, with a particular > dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute > of the lattice JSON element (JsonLattice) defines an initial set of > tiles to materialize. > > If you run the algorithm, you can omit the tiles attribute. Calcite > will choose an initial set. If you include the tiles attribute, the > algorithm will start with that list and then start finding other tiles > that are complementary (i.e. "fill in the gaps" left by the initial > tiles). > > Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to > have to compute column cardinalities by executing a "count(distinct > ...)" query for each column. We need to introduce a > CardinalityProvider SPI, with a default implementation that generates > SQL, but allows people to use more efficient means, e.g. database > stats or HyperLogLog. Column cardinalities (and joint cardinalities > such as "count(distinct state, city, zipcode)") are really important > to the algorithm. If computing cardinalities is slow the algorithm > will be slow, and if they are inaccurate the algorithm will produce > sub-optimal results. > > I should have written this all down in a blog post a long while ago. > Sorry I didn't. Now I have. > > Julian > > > > On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <[email protected]> wrote: >> >> Can anyone point me to a short summary of how to construct and use the >> "tiles" model parameter in Calcite? The slides at >> http://www.slideshare.net/julianhyde/calcite-stratany2014 >> <http://www.slideshare.net/julianhyde/calcite-stratany2014> are intriguing, >> but I can't find any video of the presentation. >> >> And does it require a specific pre-existing star schema, or can it be used >> with any arbitrary schema? Skimming >> core/src/main/java/org/apache/calcite/materialize/Lattice.java it looks like >> the CARDINALITY_MAP constant is hardwired to the foodmart database. >>
