Re: [I] Finalize SIGMOD 2024 paper ~(if accepted)~ [arrow-datafusion]

via GitHub Mon, 05 Feb 2024 17:26:08 -0800


alamb commented on issue #8373:
URL: 
https://github.com/apache/arrow-datafusion/issues/8373#issuecomment-1928607468


   Here is the reviewer feedback
   
   Reviewer #2
   Questions
   1. Is the paper readable and well organized?
   Definitely - very clear
   2. Does this paper present a significant addition to the body of work in the 
area of data management research?
   Definitely - a significant addition
   3. Is the paper likely to have a broad impact on the data management 
community?
   SIGMOD attendees will learn something interesting from the paper
   The paper is likely to influence research in the community
   4. Overall rating
   Accept
   5. Reviewer’s confidence
   Expert
   6. Strong points
   1. Good presentation of the Apache Arrow DataFusion open-source project.
   
   2. DataFusion efficiently implements operators that can be used by various 
other data systems, avoiding their cumbersome re-implementation.
   
   3. Good experimental results versus DuckDB (which is an extremely well 
optimized embeddable analytics database).
   
   4. I really appreciate how the DataFusion community was involved even in 
writing this paper. See here: 
https://github.com/apache/arrow-datafusion/issues/6782
   7. Weak points
   1. Minor: although well-engineered, the algorithms behind the supported 
operators are not new. DataFusion implements well-known techniques.
   8. Overall comments
   The paper describes the functionality of DataFusion, a very well-designed 
and implemented library based on Apache Arrow, which implements a variety of 
operators used in SQL. Similar to Arrow, DataFusion is an embeddable library 
(built in Rust), which can easily be embedded in broader data systems that 
require analytical operations. The paper includes a nice experimental 
evaluation versus DuckDB, demonstrating good results.
   
   
   Reviewer #5
   Questions
   1. Is the paper readable and well organized?
   Definitely - very clear
   2. Does this paper present a significant addition to the body of work in the 
area of data management research?
   Mostly - the contributions are above the bar
   3. Is the paper likely to have a broad impact on the data management 
community?
   SIGMOD attendees will learn something interesting from the paper
   4. Overall rating
   Accept
   5. Reviewer’s confidence
   Expert
   6. Strong points
   1. The paper is well written.
   2. Extensive evaluation using 3 popular benchmarks.
   3. An active community-driven project.
   7. Weak points
   1. The DataFusion project is a combination and integration of other 
well-known components/systems; as such, its overall technical novelty is 
limited.
   2. The experimental evaluation didn't compare against many other popular 
OLAP systems in the field.
   3. The support for complex analytical queries (e.g., multi-way join as those 
found in TPC-DS) is limited.
   8. Overall comments
   This paper is well written and the DataFusion project has a good momentum in 
the community. The idea of building an OLAP engine using a decoupled, 
component-based approach is interesting (versus tightly coupled designs). The 
paper has described most elements in DataFusion, but didn't offer enough 
details to demonstrate sufficient technical novelty (that goes beyond 
integration of various existing componentshe). How to better suit the cloud 
environment where most OLAP engines are running on nowadays is also not 
discussed in the paper.
   
   
   
   Reviewer #7
   Questions
   1. Is the paper readable and well organized?
   Mostly - the presentation has minor issues, but is acceptable
   2. Does this paper present a significant addition to the body of work in the 
area of data management research?
   Mostly - the contributions are above the bar
   3. Is the paper likely to have a broad impact on the data management 
community?
   SIGMOD attendees will learn something interesting from the paper
   4. Overall rating
   Reject
   5. Reviewer’s confidence
   Knowledgeable
   6. Strong points
   - Presents the technologies that power DataFusion and provides motivating 
use cases for using DataFusion, making a compelling argument over reuse in 
analytic systems using commodity OLAP engines and a paradigm shift in that 
direction.
   - Provides extensive evaluation of DataFusion's performance.
   - Presents DataFusion's architecture, extension APIs and features.
   7. Weak points
   - One of the main claims of the paper is that DataFusion is catalyzing the 
development of new data systems. The presentation and the evaluation of the 
paper would benefit from elaborating further on this claim.
   - Section 5.1 Engine overview and Figure 2 need to be more extensive to be 
able to follow the rest of Section 5.
   - The LLVM analogy distracts from the paper.
   - The paper claims in section 7.4 that "..DataFusion can be customized for 
these different environments using the MemoryPool trait to control memory 
allocations, the DiskManager trait for managing temporary files (if any), and a 
CacheManager for caching information such as directory contents and per-file 
metadata.". More technical details on this topic would be helpful.
   - The "Single Core Efficiency" section could benefit from running TPC-H 
across multiple threads and configuration settings. The authors mention a 
caveat of restricting duckDB performance for some benchmarks by using single 
thread.
   8. Overall comments
   I would like to thank the authors for their work. Please find some 
additional minor comments below:
   
   - Please move the figure out of the first page, or to the bottom of the 
first page. It is distracting to read the caption of Figure 1 before the 
abstract.
   
   - Please update the axes in Figure 7 to be legible.
   
   - One of the main topics of the paper is that DataFusion catalyzes the 
development of new data systems. Evaluation in that direction would help 
support the claims in the paper further. One related angle could be the ease of 
developing systems (applications) on top of DataFusion, potentially including 
the overhead in terms of lines of code or engineering hours in developing a 
simple system/application with DataFusion and using a different stack or being 
customly built. Similarly, performance evaluation of systems relying on 
DataFusion could help in this direction as well.
   
   - Similarly, the content of the paper would benefit from doing a deep dive 
into the query engine and a limited set of features based on how they are used 
by systems developed on DataFusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Finalize SIGMOD 2024 paper ~(if accepted)~ [arrow-datafusion]

Reply via email to