Re: Multi-query optimization

Julian Hyde Thu, 04 Jan 2024 16:50:21 -0800

Thanks for the great feedback. I have logged
https://issues.apache.org/jira/browse/CALCITE-6188. Can you take a
look at the 5 example SQL statements and check that your use cases
would be covered by at least one of those?


On Thu, Jan 4, 2024 at 3:18 PM Bertil Chapuis <bchap...@gmail.com> wrote:
>
> Thank you for raising this issue; the thread is really interesting. Most of 
> my use cases involve spatial data processing and visualization.
>
> Spatial data processing usually involves the execution complex Directed 
> Acyclic Graphs (DAGs) of relatively expensive tasks that need to be optimized 
> and executed in parallel. For instance, here is an bird-eye of what Apache 
> Baremaps does with PostgreSQL to produce vector maps [1]:
> 1) Download the latest OpenStreetMap planet file and other relevant data; the 
> OSM models the data with nodes (i.e., points), ways (i.e., sequences of nodes 
> representing lines or polygons), and relations (i.e., sequences of ways and 
> nodes representing more complex geometries).
> 2) Create temporary tables with all the nodes and references between ways and 
> nodes to create geometries.
> 3) Create tables with the planet file and the temporary tables that contain 
> nodes, ways, and relations, all containing actual geometries instead of 
> references.
> 4) Create materialized views that contain cleaner records (merging the lines 
> of the road network, merging contiguous polygons of the same type such as 
> forests, etc.).
> 5) Create materialized views that simplify the geometries at different zoom 
> levels (world, continent, country, city, street, etc.).
> 6) Query the tables and materialized views to generate a large cache 
> containing preprocessed files that can be visualized in the browser.
>
> Executing such a DAG for the planet requires a decent amount of storage and 
> compute power. As the process of designing maps is relatively empirical, one 
> cannot afford to re-execute the whole DAG to test small changes (e.g., 
> tweaking the simplifications applied to the road network) and intermediary 
> results need to be preserved. Another significant difficulty arises when the 
> OpenStreetMap data gets updated: one approach runs the whole DAG again; a 
> better approach would selectively update the tables, materialized views, and 
> cache with the delta.
>
> The syntax we currently use to represent our DAG is limited and not 
> satisfactory [2], for instance, the inputs and outputs of the tasks could be 
> used to define the edges. At some point, I wondered if all of this could be 
> replaced with incremental view maintenance (IVM). To some extent, one could 
> create a DAG where each vertex is a table or materialized view. Then, 
> inserting new data on one end would update the data on the other end. 
> Overall, it really feels like a multi-query optimization problem.
>
> What I just describes probably feels like a wobbly assembly ;) and I started 
> contributing to Apache Calcite when looking for better solutions. I’d be 
> really happy to help!
>
> [1] https://demo.baremaps.com/#14/46.58954/6.55072
> [2] https://github.com/apache/incubator-baremaps/blob/main/basemap/import.js
>
>
> On 4 Jan 2024, at 21:52, Andrii Tsvielodub <a.tsvelo...@gmail.com> wrote:
>
> This would be an exciting extension.
> I have experience using Calcite as a backend for a visualization platform,
> and we had to develop similar functionality on top, as it was essential for
> many optimizations we had in mind.
>
> Here's a quick outline of our use cases:
> - Re-use of intermediate results. Two flavors here:
> -- A common part of a request is extracted, and the data is cached in the
> scope of one query. This could have probably been implemented using spool,
> but we did it using a custom optimization program.
> -- One query is pushed down into the DB, and then another query is run on
> top of that result and produces the final resultset. This was probably the
> most frequently used optimization. The first query did partial aggregation
> in the DB, which was very beneficial for performance.
> - Using the results of one query in another query. There were numerous
> posts on the mailing list about building a dynamic filter for pushdown. We
> did this by simply running several queries sequentially. For us, the
> benefit of offloading both queries to the DB outweighed the drawback of
> possibly scanning the same data twice.
> - Re-using intermediate result(again), but returning several resultsets.
> Doing several levels of aggregation on top of raw data(or the smallest
> aggregation level).
> - Running several independent queries on top of the same dataset. It is
> probably not much of an optimization from the DB standpoint (if we ignore
> all the smart multi-query optimizations, which I think not too many DBs
> actually currently have). However, it had some minor communication-level
> improvements for us as a data layer of the product.
>
> This was all implemented on top of Calcite, by building rel trees of
> different complexity. Some of that logic was abstracted into our
> "Statement" object; others were just pieces of logic during Rel tree
> building. But I think most of those optimizations could be implemented in
> Calcite, if there are enough extension points to plug in custom logic.
>
> There's probably nothing new here, but I hope this confirms such
> functionality's usefulness.
>
> Would be happy to help with the design/implementation.
>
> On Thu, Jan 4, 2024 at 10:47 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>
> I didn’t actually have an algorithm in mind. I wanted to kick off a
> research project, by saying ’This is a common problem, and here is a
> language for describing it. Does anyone have ideas for how to solve it?’.
>
> The issue keeps coming up, and I want to be able to say ‘Yes, we’ve been
> thinking about that in Calcite’.
>
> So, I’ll go ahead and log that Jira case. I’d welcome improvements —
> better SQL syntax, better set of use cases — from people who have been
> working in this space.
>
> I hadn’t thought about field trimming across trees, but that sounds very
> important and useful.
>
> I suspect that the Volcano approach will get us 80% of the way there.
> Dynamic programming doesn’t really care whether we are working on a tree, a
> forest, or a collection of trees loosely connected into a DAG. The only
> problem — as noted in https://issues.apache.org/jira/browse/CALCITE-481 —
> is that Volcano’s cost model will tend to double-count.
>
> But my goal isn’t to solve the problem. Once we have a good set of use
> cases (pure query, pure DML, hybrid query+DML, cyclic data flows) expressed
> in SQL, we can start discussing algorithms.
>
>
> On Jan 3, 2024, at 9:25 PM, Benchao Li <libenc...@apache.org> wrote:
>
> This is a very fundamental use case in Flink. And in Flink, we have a
> mechanism that breaks the DAG into sub trees[1] and then utilizes
> Calcite to do the optimization of each sub tree. This prevents many
> optimization chances across sub trees, such as field trimming, filter
> pushdown, and many more transposing rules to utilize cost-based
> optimization.
>
> The benefit we get via that way is that we can avoid many complex
> optimization corner cases since sub trees are simpler than the whole
> DAG, we even hacked to utilize this non formal character to work
> around some planner corner cases (could not produce a valid plan /
> optimization dead loops) temporarily before we finally solved the root
> cause.
>
> Solving this problem in Calcite sounds good to me, the only one
> concern is the optimization complexity (I'm not sure how you planned
> to do it yet, it's just a heads up from my experience). I know it's
> more related to how users choose the rule sets and the complexity of
> queries, however it may be one of the factors that affects the
> usability of multi-query optimization.
>
> [1]
>
> https://github.com/apache/flink/blob/8e8e2d649e98071bc15ee768cf65da3e96f255b4/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/optimize/CommonSubGraphBasedOptimizer.scala#L58
>
>
> Forward Xu <forwardxu...@gmail.com> 于2024年1月4日周四 10:51写道：
>
>
> Thanks to the community for the related work that has been done, This
>
> is a
>
> good discussion,  and I think it can be used in, for example, better
> multi-query optimization algorithms, more effective query result caching
> strategies, smarter materialized view selection and maintenance methods,
> etc. These directions require further research and exploration.
>
> Finally here is a related paper MULTI-QUERY OPTIMIZATION AND
>
> APPLICATIONS]
>
> <
>
> https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2015/2014/Papers/prasan-thesis.pdf
>
>
> .
>
> Best，
> ForwardXu
>
> Julian Hyde <jh...@apache.org> 于2024年1月4日周四 04:09写道：
>
> Multi-query optimization is one of the big challenges of our field.
> Examples of multi-queries:
> * an INSERT statement that writes into a table but also updates an
>
> index,
>
> * a DAG that represents an ETL/ELT job;
> * a query that produces several data sets (say a list of invoices for
> orders and a list of products that need to be restocked);
> * a query that uses intermediate results more than once.
>
> Common features of multi-queries are multiple output data sets, and
> re-used intermediate results. In other words, the dataflow graph is a
> DAG rather than just a tree.
>
> I think it would be useful to frame the problem by extending SQL so
> that such multi-queries can be represented as a single unit. From that
> would follow extensions to relational algebra, and improvements to
> planner algorithms and cost models.
>
> I would like to hear people's thoughts before I log a Jira case with a
> sketch of the problem.
>
> Related work:
> * https://issues.apache.org/jira/browse/CALCITE-481 Spool operator
> * https://issues.apache.org/jira/browse/CALCITE-1440 Multiple RelNodes
> * https://issues.apache.org/jira/browse/CALCITE-4568 Incremental
> query optimization
> * https://issues.apache.org/jira/browse/CALCITE-129 Recursive queries
>
> Julian
>
>
>
>
> --
>
> Best,
> Benchao Li
>
>
>
>

Re: Multi-query optimization

Reply via email to