alamb commented on issue #8373: URL: https://github.com/apache/arrow-datafusion/issues/8373#issuecomment-1928607468
Here is the reviewer feedback Reviewer #2 Questions 1. Is the paper readable and well organized? Definitely - very clear 2. Does this paper present a significant addition to the body of work in the area of data management research? Definitely - a significant addition 3. Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper The paper is likely to influence research in the community 4. Overall rating Accept 5. Reviewer’s confidence Expert 6. Strong points 1. Good presentation of the Apache Arrow DataFusion open-source project. 2. DataFusion efficiently implements operators that can be used by various other data systems, avoiding their cumbersome re-implementation. 3. Good experimental results versus DuckDB (which is an extremely well optimized embeddable analytics database). 4. I really appreciate how the DataFusion community was involved even in writing this paper. See here: https://github.com/apache/arrow-datafusion/issues/6782 7. Weak points 1. Minor: although well-engineered, the algorithms behind the supported operators are not new. DataFusion implements well-known techniques. 8. Overall comments The paper describes the functionality of DataFusion, a very well-designed and implemented library based on Apache Arrow, which implements a variety of operators used in SQL. Similar to Arrow, DataFusion is an embeddable library (built in Rust), which can easily be embedded in broader data systems that require analytical operations. The paper includes a nice experimental evaluation versus DuckDB, demonstrating good results. Reviewer #5 Questions 1. Is the paper readable and well organized? Definitely - very clear 2. Does this paper present a significant addition to the body of work in the area of data management research? Mostly - the contributions are above the bar 3. Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper 4. Overall rating Accept 5. Reviewer’s confidence Expert 6. Strong points 1. The paper is well written. 2. Extensive evaluation using 3 popular benchmarks. 3. An active community-driven project. 7. Weak points 1. The DataFusion project is a combination and integration of other well-known components/systems; as such, its overall technical novelty is limited. 2. The experimental evaluation didn't compare against many other popular OLAP systems in the field. 3. The support for complex analytical queries (e.g., multi-way join as those found in TPC-DS) is limited. 8. Overall comments This paper is well written and the DataFusion project has a good momentum in the community. The idea of building an OLAP engine using a decoupled, component-based approach is interesting (versus tightly coupled designs). The paper has described most elements in DataFusion, but didn't offer enough details to demonstrate sufficient technical novelty (that goes beyond integration of various existing componentshe). How to better suit the cloud environment where most OLAP engines are running on nowadays is also not discussed in the paper. Reviewer #7 Questions 1. Is the paper readable and well organized? Mostly - the presentation has minor issues, but is acceptable 2. Does this paper present a significant addition to the body of work in the area of data management research? Mostly - the contributions are above the bar 3. Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper 4. Overall rating Reject 5. Reviewer’s confidence Knowledgeable 6. Strong points - Presents the technologies that power DataFusion and provides motivating use cases for using DataFusion, making a compelling argument over reuse in analytic systems using commodity OLAP engines and a paradigm shift in that direction. - Provides extensive evaluation of DataFusion's performance. - Presents DataFusion's architecture, extension APIs and features. 7. Weak points - One of the main claims of the paper is that DataFusion is catalyzing the development of new data systems. The presentation and the evaluation of the paper would benefit from elaborating further on this claim. - Section 5.1 Engine overview and Figure 2 need to be more extensive to be able to follow the rest of Section 5. - The LLVM analogy distracts from the paper. - The paper claims in section 7.4 that "..DataFusion can be customized for these different environments using the MemoryPool trait to control memory allocations, the DiskManager trait for managing temporary files (if any), and a CacheManager for caching information such as directory contents and per-file metadata.". More technical details on this topic would be helpful. - The "Single Core Efficiency" section could benefit from running TPC-H across multiple threads and configuration settings. The authors mention a caveat of restricting duckDB performance for some benchmarks by using single thread. 8. Overall comments I would like to thank the authors for their work. Please find some additional minor comments below: - Please move the figure out of the first page, or to the bottom of the first page. It is distracting to read the caption of Figure 1 before the abstract. - Please update the axes in Figure 7 to be legible. - One of the main topics of the paper is that DataFusion catalyzes the development of new data systems. Evaluation in that direction would help support the claims in the paper further. One related angle could be the ease of developing systems (applications) on top of DataFusion, potentially including the overhead in terms of lines of code or engineering hours in developing a simple system/application with DataFusion and using a different stack or being customly built. Similarly, performance evaluation of systems relying on DataFusion could help in this direction as well. - Similarly, the content of the paper would benefit from doing a deep dive into the query engine and a limited set of features based on how they are used by systems developed on DataFusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
