You are correct that the term "lattice" is borrowed from mathematics, and in particular from the seminal paper[1]. But it's just a metaphor for a data structure, just as are "cube".
By "schema" I mean the kind of thing one defines in ERWin - a collection of tables connected by relationships. A "star schema" is a well-known concept in the data warehousing world. It is a schema with a particular structure, namely, that all relationships point "outwards". Your definition of a schema as a set of possible queries and the "can be answered using" mathematical relation is a generalization of mine, but I claim it is less useful. My lattice is defined structurally. It only deals with filter-join-aggregate queries. Queries of this kind are really important for interactive DW, and focusing on this kind of queries makes the recommendation & matching algorithms tractable. Calcite also lets you define materialized views outside of a lattice, and it does its best to match queries to such materialized views, but does less well. In particular, It doesn't match join queries well. The search space for unifying a materialized view containing an N-way join to a query containing an N-way join is exponential in N, so it doesn't try. If you define a materialized view within a lattice, the structure provided by the lattice allows join queries to be matched very quickly. Julian [1] "Implementing data cubes efficiently", Harinarayan, Rajaraman, Ullmann 1997 http://web.eecs.umich.edu/~jag/eecs584/papers/implementing_data_cube.pdf. On Mon, May 4, 2015 at 2:28 PM, Ted Dunning <[email protected]> wrote: > Julian, > > I think that what you are talking about is related to the mathematical > definition of a lattice is a partially ordered set in which least upper > bound and greatest lower bound are defined for every pair of elements. > > In this case, the objects in the set are aggregation queries and the > partial ordering refers to the variables included in the group by. You can > take the ordering as to be defined between queries A and B such that A < B > if the variables in the group by part of A is a strict subset of the > variables in the group by part of B. This means that the least upper bound > between two queries as the query with the union of all grouping variables > and the greatest lower bound as the query with the intersection of grouping > variables. > > This is all straightforward except for bringing in a fair bit of > terminology. > > I am confused by one thing you say, however, and I wonder if others are as > well. > > You say "A lattice represents a star (or snowflake) schema, not a general > schema". If my interpretation of what you say is correct, the lattice is > not a schema at all, but instead a set of queries that are defined over a > schema with ordering defined as above. > > Is it merely a shorthand to refer to a schema as equivalent to the lattice > composed of all the possible aggregation queries with all combinations of > dimensions? > > On Mon, May 4, 2015 at 6:30 PM, Julian Hyde <[email protected]> wrote: > >> A lattice represents a star (or snowflake) schema, not a general >> schema. In particular, all relationships must be many-to-one, heading >> from a fact table at the center of the star. >> >> The lattice definition uses a SQL statement to represent the star. SQL >> is a useful short-hand to represent several tables joined together, >> and assigning aliases to the column names (it more convenient than >> inventing a new language to represent relationships, join conditions >> and cardinalities). >> >> Unlike regular SQL, order is important. If you put A before B in the >> FROM clause, and make a join between A and B, you are saying that >> there is a many-to-one foreign key relationship from A to B. (E.g. in >> the example lattice, the Sales fact table occurs before the Time >> dimension table, and before the Product dimension table. The Product >> dimension table occurs before the ProductClass outer dimension table, >> further down an arm of a snowflake.) >> >> A lattice implies constraints. In the A to B relationship, there is a >> foreign key on A (i.e. every value of A's foreign key has a >> corresponding value in B's key), and a unique key on B (i.e. no key >> value occurs more than once). These constraints are really important, >> because it allows the planner to remove joins to tables whose columns >> are not being used, and know that the query results will not change. >> >> Calcite does not check these constraints. If they are violated, >> Calcite will return wrong results. >> >> A lattice is a big, virtual join view. It is not materialized (it >> would be several times larger than the star schema, because of >> denormalization) and you probably wouldn't want to query it (far too >> many columns). So what is it useful for? As we said above, (a) the >> lattice declares some very useful primary and foreign key constraints, >> (b) it helps the query planner map user queries onto >> filter-join-aggregate materialized views (the most useful kind of >> materialized view for DW queries), (c) gives Calcite a framework >> within which to gather stats about data volumes and user queries, (d) >> allows Calcite to automatically design and populate materialized >> views. >> >> Most star schema models force you to choose whether a column is a >> dimension or a measure. In a lattice, every column is a dimension >> column. (I.e. it can become one of the columns in the GROUP BY clause >> to query the star schema at a particular dimensionality). Any column >> can also be used in a measure; you define measures by giving the >> column and an aggregate function. >> >> If "unit_sales" tends to be used much more often as a measure rather >> than a dimension, that's fine. Calcite's algorithm should notice that >> it is rarely aggregated, and not be inclined to create tiles that >> aggregate on it. (By "should" I mean "could and one day will". The >> algorithm does not currently take query history into account when >> designing tiles.) >> >> But someone might want to know whether orders with fewer than 5 items >> were more or less profitable than orders with more than 100. All of a >> sudden, "unit_sales" has become a dimension. If there's virtually zero >> cost to declaring a column a dimension column, I figured let's make >> them all dimension columns. >> >> The model allows for a particular table to be used more than once, >> with a different table alias. You could use this to model say >> OrderDate and ShipDate, with two uses to the Time dimension table. >> >> Most SQL systems require that the column names in a view are unique. >> This is hard to achieve in a lattice, because you often include >> primary and foreign key columns in a join. So Calcite lets you refer >> to columns in two ways. If the column is unique, you can use its name, >> ["unit_sales"]. Whether or not it is unique in the lattice, it will be >> unique in its table, so you can use it qualified by its table alias. >> Examples: ["sales", "unit_says", ["ship_date", "time_id"], >> ["order_date", "time_id"]. >> >> A "tile" is a materialized table in a lattice, with a particular >> dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute >> of the lattice JSON element (JsonLattice) defines an initial set of >> tiles to materialize. >> >> If you run the algorithm, you can omit the tiles attribute. Calcite >> will choose an initial set. If you include the tiles attribute, the >> algorithm will start with that list and then start finding other tiles >> that are complementary (i.e. "fill in the gaps" left by the initial >> tiles). >> >> Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to >> have to compute column cardinalities by executing a "count(distinct >> ...)" query for each column. We need to introduce a >> CardinalityProvider SPI, with a default implementation that generates >> SQL, but allows people to use more efficient means, e.g. database >> stats or HyperLogLog. Column cardinalities (and joint cardinalities >> such as "count(distinct state, city, zipcode)") are really important >> to the algorithm. If computing cardinalities is slow the algorithm >> will be slow, and if they are inaccurate the algorithm will produce >> sub-optimal results. >> >> I should have written this all down in a blog post a long while ago. >> Sorry I didn't. Now I have. >> >> Julian >> >> >> >> On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <[email protected]> >> wrote: >> > >> > Can anyone point me to a short summary of how to construct and use the >> "tiles" model parameter in Calcite? The slides at >> http://www.slideshare.net/julianhyde/calcite-stratany2014 < >> http://www.slideshare.net/julianhyde/calcite-stratany2014> are >> intriguing, but I can't find any video of the presentation. >> > >> > And does it require a specific pre-existing star schema, or can it be >> used with any arbitrary schema? Skimming >> core/src/main/java/org/apache/calcite/materialize/Lattice.java it looks >> like the CARDINALITY_MAP constant is hardwired to the foodmart database. >> > >>
