[ https://issues.apache.org/jira/browse/ARROW-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458047#comment-17458047 ]
Will Jones commented on ARROW-14730: ------------------------------------ I've thought about this a little more and I think it might make more sense to do the Dataset implementation of Delta Lake in C++ and not wrap delta-rs. From a maintenance perspective, we already have expertise maintaining the code, tests, and CI associated with Python / R / C++; trying to replicate that in delta-rs might prove difficult. And from a code perspective, a Delta Lake reader requires Filesystem (local, S3) and format reader (Parquet, JSON) implementations, and so if the PyArrow and R arrow implementations used delta-rs they would inevitably contain a Rust and C++ implementations of that with potentially different behaviors. I don't think we can avoid that. That said, it still might make sense to have the code live in delta-io GitHub organization. AFAIK we don't yet have dataset implementations that live outside of the Arrow repo, but that's something we'd eventually like to support, right? > [C++][R][Python] Support reading from Delta Lake tables > ------------------------------------------------------- > > Key: ARROW-14730 > URL: https://issues.apache.org/jira/browse/ARROW-14730 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Will Jones > Priority: Major > > [Delta Lake|https://delta.io/] is a parquet table format that supports ACID > transactions. It's popularized by Databricks, which uses it as the default > table format in their platform. Previously, it's only been readable from > Spark, but now there is an effort in > [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from > elsewhere. There is already some integration with DataFusion (see: > https://github.com/apache/arrow-datafusion/issues/525). > There does already exist [a method to read Delta Lake tables into Arrow > tables in > Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table] > in the delta-rs Python bindings. This includes filtering by partitions. > Is there a good way we could integrate this functionality with Arrow C++ > Dataset and expose that in Python and R? Would that be something that should > be implemented in Arrow libraries or in delta-rs? -- This message was sent by Atlassian Jira (v8.20.1#820001)