Thanks for the great feedback. I have logged https://issues.apache.org/jira/browse/CALCITE-6188. Can you take a look at the 5 example SQL statements and check that your use cases would be covered by at least one of those?
On Thu, Jan 4, 2024 at 3:18 PM Bertil Chapuis <bchap...@gmail.com> wrote: > > Thank you for raising this issue; the thread is really interesting. Most of > my use cases involve spatial data processing and visualization. > > Spatial data processing usually involves the execution complex Directed > Acyclic Graphs (DAGs) of relatively expensive tasks that need to be optimized > and executed in parallel. For instance, here is an bird-eye of what Apache > Baremaps does with PostgreSQL to produce vector maps [1]: > 1) Download the latest OpenStreetMap planet file and other relevant data; the > OSM models the data with nodes (i.e., points), ways (i.e., sequences of nodes > representing lines or polygons), and relations (i.e., sequences of ways and > nodes representing more complex geometries). > 2) Create temporary tables with all the nodes and references between ways and > nodes to create geometries. > 3) Create tables with the planet file and the temporary tables that contain > nodes, ways, and relations, all containing actual geometries instead of > references. > 4) Create materialized views that contain cleaner records (merging the lines > of the road network, merging contiguous polygons of the same type such as > forests, etc.). > 5) Create materialized views that simplify the geometries at different zoom > levels (world, continent, country, city, street, etc.). > 6) Query the tables and materialized views to generate a large cache > containing preprocessed files that can be visualized in the browser. > > Executing such a DAG for the planet requires a decent amount of storage and > compute power. As the process of designing maps is relatively empirical, one > cannot afford to re-execute the whole DAG to test small changes (e.g., > tweaking the simplifications applied to the road network) and intermediary > results need to be preserved. Another significant difficulty arises when the > OpenStreetMap data gets updated: one approach runs the whole DAG again; a > better approach would selectively update the tables, materialized views, and > cache with the delta. > > The syntax we currently use to represent our DAG is limited and not > satisfactory [2], for instance, the inputs and outputs of the tasks could be > used to define the edges. At some point, I wondered if all of this could be > replaced with incremental view maintenance (IVM). To some extent, one could > create a DAG where each vertex is a table or materialized view. Then, > inserting new data on one end would update the data on the other end. > Overall, it really feels like a multi-query optimization problem. > > What I just describes probably feels like a wobbly assembly ;) and I started > contributing to Apache Calcite when looking for better solutions. I’d be > really happy to help! > > [1] https://demo.baremaps.com/#14/46.58954/6.55072 > [2] https://github.com/apache/incubator-baremaps/blob/main/basemap/import.js > > > On 4 Jan 2024, at 21:52, Andrii Tsvielodub <a.tsvelo...@gmail.com> wrote: > > This would be an exciting extension. > I have experience using Calcite as a backend for a visualization platform, > and we had to develop similar functionality on top, as it was essential for > many optimizations we had in mind. > > Here's a quick outline of our use cases: > - Re-use of intermediate results. Two flavors here: > -- A common part of a request is extracted, and the data is cached in the > scope of one query. This could have probably been implemented using spool, > but we did it using a custom optimization program. > -- One query is pushed down into the DB, and then another query is run on > top of that result and produces the final resultset. This was probably the > most frequently used optimization. The first query did partial aggregation > in the DB, which was very beneficial for performance. > - Using the results of one query in another query. There were numerous > posts on the mailing list about building a dynamic filter for pushdown. We > did this by simply running several queries sequentially. For us, the > benefit of offloading both queries to the DB outweighed the drawback of > possibly scanning the same data twice. > - Re-using intermediate result(again), but returning several resultsets. > Doing several levels of aggregation on top of raw data(or the smallest > aggregation level). > - Running several independent queries on top of the same dataset. It is > probably not much of an optimization from the DB standpoint (if we ignore > all the smart multi-query optimizations, which I think not too many DBs > actually currently have). However, it had some minor communication-level > improvements for us as a data layer of the product. > > This was all implemented on top of Calcite, by building rel trees of > different complexity. Some of that logic was abstracted into our > "Statement" object; others were just pieces of logic during Rel tree > building. But I think most of those optimizations could be implemented in > Calcite, if there are enough extension points to plug in custom logic. > > There's probably nothing new here, but I hope this confirms such > functionality's usefulness. > > Would be happy to help with the design/implementation. > > On Thu, Jan 4, 2024 at 10:47 PM Julian Hyde <jhyde.apa...@gmail.com> wrote: > > I didn’t actually have an algorithm in mind. I wanted to kick off a > research project, by saying ’This is a common problem, and here is a > language for describing it. Does anyone have ideas for how to solve it?’. > > The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been > thinking about that in Calcite’. > > So, I’ll go ahead and log that Jira case. I’d welcome improvements — > better SQL syntax, better set of use cases — from people who have been > working in this space. > > I hadn’t thought about field trimming across trees, but that sounds very > important and useful. > > I suspect that the Volcano approach will get us 80% of the way there. > Dynamic programming doesn’t really care whether we are working on a tree, a > forest, or a collection of trees loosely connected into a DAG. The only > problem — as noted in https://issues.apache.org/jira/browse/CALCITE-481 — > is that Volcano’s cost model will tend to double-count. > > But my goal isn’t to solve the problem. Once we have a good set of use > cases (pure query, pure DML, hybrid query+DML, cyclic data flows) expressed > in SQL, we can start discussing algorithms. > > > On Jan 3, 2024, at 9:25 PM, Benchao Li <libenc...@apache.org> wrote: > > This is a very fundamental use case in Flink. And in Flink, we have a > mechanism that breaks the DAG into sub trees[1] and then utilizes > Calcite to do the optimization of each sub tree. This prevents many > optimization chances across sub trees, such as field trimming, filter > pushdown, and many more transposing rules to utilize cost-based > optimization. > > The benefit we get via that way is that we can avoid many complex > optimization corner cases since sub trees are simpler than the whole > DAG, we even hacked to utilize this non formal character to work > around some planner corner cases (could not produce a valid plan / > optimization dead loops) temporarily before we finally solved the root > cause. > > Solving this problem in Calcite sounds good to me, the only one > concern is the optimization complexity (I'm not sure how you planned > to do it yet, it's just a heads up from my experience). I know it's > more related to how users choose the rule sets and the complexity of > queries, however it may be one of the factors that affects the > usability of multi-query optimization. > > [1] > > https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58 > > > Forward Xu <forwardxu...@gmail.com> 于2024年1月4日周四 10:51写道: > > > Thanks to the community for the related work that has been done, This > > is a > > good discussion, and I think it can be used in, for example, better > multi-query optimization algorithms, more effective query result caching > strategies, smarter materialized view selection and maintenance methods, > etc. These directions require further research and exploration. > > Finally here is a related paper MULTI-QUERY OPTIMIZATION AND > > APPLICATIONS] > > < > > https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf > > > . > > Best, > ForwardXu > > Julian Hyde <jh...@apache.org> 于2024年1月4日周四 04:09写道: > > Multi-query optimization is one of the big challenges of our field. > Examples of multi-queries: > * an INSERT statement that writes into a table but also updates an > > index, > > * a DAG that represents an ETL/ELT job; > * a query that produces several data sets (say a list of invoices for > orders and a list of products that need to be restocked); > * a query that uses intermediate results more than once. > > Common features of multi-queries are multiple output data sets, and > re-used intermediate results. In other words, the dataflow graph is a > DAG rather than just a tree. > > I think it would be useful to frame the problem by extending SQL so > that such multi-queries can be represented as a single unit. From that > would follow extensions to relational algebra, and improvements to > planner algorithms and cost models. > > I would like to hear people's thoughts before I log a Jira case with a > sketch of the problem. > > Related work: > * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator > * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes > * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental > query optimization > * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries > > Julian > > > > > -- > > Best, > Benchao Li > > > >