I had some ideas to refactor the current FileIO implementations in PyIceberg to consolidate the behaviors for FsSpec and PyArrow. https://github.com/apache/iceberg-python/issues/310
There are also some additional concerns around URI parsing based on the specific FileIO implementation. Perhaps a good litmus test for a well-defined FileIO interface is to introduce nanoarrow <https://arrow.apache.org/nanoarrow/latest/getting-started/python.html> and see what breaks. Thanks, Kevin Liu On Tue, Aug 6, 2024 at 11:21 AM Kevin Liu <kevin.jq....@gmail.com> wrote: > > First, we need to establish a workflow that allows us to gradually > integrate new features into pyiceberg-core. Additionally, pyiceberg should > be able to import and optionally use classes from pyiceberg-core in an > additive manner. While developing this workflow, our community will learn > how to collaborate, manage releases, and more. > > +1 I would like to learn more about how to integrate pyiceberg-core into > PyIceberg. The initial setup should give us a framework for future > integrations. > > I also think there's some prerequisite work on the Pyiceberg side to > clean up FileIO. A lot of features are built specifically with PyArrow > dependency, such as writing to the table. > > > Thanks, > Kevin > > > On Mon, Aug 5, 2024 at 12:19 AM Xuanwo <xua...@apache.org> wrote: > >> For the FileIO part, just curious—since Rust's FileIO currently also uses >> OpenDAL, will there be any functional differences in terms of supported >> storage services or configurations (like profile_name, signer, etc.) >> compared to using opendalfs directly in Python in the future? Will Rust's >> FileIO introduce any customizations/optimizations/extensions beyond what >> OpenDAL supports? >> >> >> Hi, Honah >> >> I believe there should be no functional differences. We can implement the >> exact same thing for both pyiceberg_core FileIO and opendalfs fsspec FileIO. >> >> The main difference I've noticed is in where the configuration parsing >> occurs. >> >> The pyiceberg_core FileIO directly exposes the FileIO class, which can >> inherently understand iceberg properties. We can pass these properties >> directly to initialize file IO without any additional effort on the >> pyiceberg side. >> >> However, for opendalfs fsspec FileIO, we need to parse the properties and >> convert them into appropriate opendalfs options for it to function properly. >> >> On Mon, Aug 5, 2024, at 15:04, Honah J. wrote: >> >> Thanks Xuanwo for driving this and everyone for discussing, >> >> I like the idea of pushing down low-level logic to Iceberg-rust >> (pyiceberg_core). It’s great to have another option besides PyArrow for >> reading and writing data in PyIceberg. Thanks, Xuanwo, for moving this >> forward with the initial PR to add pyiceberg_core. >> >> For the FileIO part, just curious—since Rust's FileIO currently also uses >> OpenDAL, will there be any functional differences in terms of supported >> storage services or configurations (like profile_name, signer, etc.) >> compared to using opendalfs directly in Python in the future? Will Rust's >> FileIO introduce any customizations/optimizations/extensions beyond what >> OpenDAL supports? >> >> Best regards, >> Honah >> >> >> >> On Sat, Aug 3, 2024 at 4:12 PM timog...@proton.me.INVALID >> <timog...@proton.me.invalid> wrote: >> >> Fantastic work! I think this is a great direction, and this provides a >> good base to start iterating. >> >> It makes the most sense to me for the Python bindings (and others) to >> live in the same repo as iceberg-rust, especially at this early stage. >> >> - Tim O'Guin >> >> >> -------- Original Message -------- >> On 8/3/24 12:33 AM, Xuanwo wrote: >> >> >> Let's rock! Welcome to take a review: >> https://github.com/apache/iceberg-rust/pull/518 >> >> On Sat, Aug 3, 2024, at 12:13, Xuanwo wrote: >> >> I also support integrating iceberg-rust with pyiceberg rather than >> building something new on OpenDAL. >> >> OpenDAL backed FileIO will be usable in Python once opendalfs[1], the >> native fsspec support for OpenDAL, is ready. Users can use opendalfs as a >> FileIO class directly in pure python. It's not an action item for our >> community to take. >> >> The consensus we've reached is that iceberg-rust will be the core of >> PyIceberg. The main question now is "How?" How can we implement it without >> disrupting our valued users? This is my top priority. >> >> *Naming is so hard! Let's refer to the new iceberg-rust based pyiceberg >> core as `pyiceberg-core` until we decide on a project name.* >> >> First, we need to establish a workflow that allows us to gradually >> integrate new features into pyiceberg-core. Additionally, pyiceberg should >> be able to import and optionally use classes from pyiceberg-core in an >> additive manner. While developing this workflow, our community will learn >> how to collaborate, manage releases, and more. >> >> We will then incorporate additional Rust-backed features into >> pyiceberg-core. Eventually, we may make pyiceberg-core our default >> implementation. >> >> My current plan is to implement this pyiceberg-core under iceberg-rust >> repo under `bindings/python`. >> >> - Iceberg-rust is currently under active development. I plan to release >> pyiceberg-core independently of iceberg-rust's release, as they feature >> distinct public APIs (and languages!). >> - Most of the work involves maintaining a few Python stubs and classes, >> with the majority related to Rust. >> - The python integration is just a start: we can expect `bindings/nodejs` >> to happen here too. >> >> The setup work has already been started. I will update my PR here once >> it's ready to review. >> >> [1]: https://github.com/fsspec/opendalfs >> >> On Sat, Aug 3, 2024, at 09:57, Renjie Liu wrote: >> >> Hi: >> >> I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's >> FileIO, rather than directly using OpenDAL. The motivation is that we can >> use this as a starting point of providing iceberg-rust backed components >> for pyiceberg, and due to its simplicity, it's a good case. I believe there >> will be more cases, like Sung mentioned transform in another thread, and >> table scan mentioned by Fokko. >> >> If we want to use OpenDAL directly, we don't need iceberg-rust, since >> OpenDAL already has python binding: >> https://opendal.apache.org/docs/python/opendal.html >> >> Do you have any experience with this? I see many projects having Rust and >> Python code in a single repository. There are some exceptions like >> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core >> <https://github.com/pydantic/pydantic-core>). >> >> >> Well, first I want to say providing a python binding for a library >> written in rust is a quite common practice. Just to name a few: opendal >> <https://github.com/apache/opendal>, polars >> <https://github.com/pola-rs/polars>, datafusion >> <https://github.com/apache/datafusion>, delta-rs >> <https://github.com/delta-io/delta-rs>. As far as I know, most of them >> choose to put python binding with rust in the same repo, only >> datafusion-python <https://github.com/apache/datafusion-python> lives in >> another, I'm not sure about the reason, maybe it's too large? >> >> I haven't tried to implement one before, but pyo3 >> <https://github.com/PyO3> has great documentation, and there are many >> existing examples in open source we can learn with. >> >> On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote: >> >> One more thing, >> >> About this idea, would you have a more detailed design? For example, >> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >> >> >> Do you have any experience with this? I see many projects having Rust and >> Python code in a single repository. There are some exceptions like >> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core >> <https://github.com/pydantic/pydantic-core>). >> >> Kind regards, >> Fokko >> >> >> >> Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>: >> >> Thanks for driving this Xuanwo, >> >> I already suggested this in my talk back at the Spark Summit to see if we >> can spark some interest, and it is exciting to see this materialize. >> >> For the IO abstraction, I think the FileIO is the best option. We already >> have the interface >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239> >> in PyIceberg, and also a PyArrowFileIO >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>. >> I must admit that the abstraction is less clear in PyIceberg since we rely >> so much on Arrow for reading/writing data that it is tightly coupled. I >> would love to see if we can use OpenDAL for reading/writing data, and >> Iceberg-rust for pushing down the low-level logic. A while ago I did some >> profiling on the code, and one of the major issues is that Arrow doesn't >> support proper field-ID projection. Therefore we have to the Parquet file, >> and do the schema-evolution and type promotion afterwards in Python >> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>, >> which causes a lot of congestion on the GIL. >> >> Kind regards, >> Fokko >> >> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>: >> >> +1 for an OpenDALFileIO >> >> -Jack >> >> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote: >> >> >> Hi, renjie >> >> Thank you for your support. I'll delve into the details and first build a >> PoC PR to make it clear. >> >> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote: >> >> Hi: >> >> Thanks Xuanwo for raising this. >> >> As mentioned in another thread, I think using iceberg-rust in pyiceberg >> is a good idea. >> >> About this idea, would you have a more detailed design? For example, >> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What >> kind of interface should we provide to pyiceberg, FileIO or OpenDAL? >> >> I think this is a good first step moving forward to make pyiceberg backed >> iceberg-rust. In the future we can replace components gradually. >> >> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote: >> >> >> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >> implementations via pyo3 / fsspec bindings >> <https://github.com/apache/opendal/issues/4511>? >> >> Hi, Raschkowski, good question! >> >> It's possible. There is an ongoing project developing fsspec bindings for >> opendal at https://github.com/fsspec/opendalfs. Once complete, we can >> directly use opendal through fsspec. >> >> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users should >> be able to use opendalfs as an alternative implementation of the fsspec >> AbstractFileSystem class. >> >> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote: >> >> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL >> implementations via pyo3 / fsspec bindings >> <https://github.com/apache/opendal/issues/4511>? >> >> >> ------------------------------ >> >> *From:* Joe Stein <crypt...@gmail.com> >> *Sent:* Thursday, August 1, 2024 3:37 AM >> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io >> >> *CAUTION:* This email originates from an external party (outside of >> Palantir). If you believe this message is suspicious in nature, please use >> the "Report Message" button built into Outlook. >> >> Kafka did this with librdkafka and was wildly successful. The underlying >> bindings being in rust are great with a layer for access in Python +1 >> >> >> ~ Joe Stein >> >> >> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote: >> >> Hello everyone >> >> I start this thread to discuss the idea about using iceberg-rust as >> pyiceberg file io. >> >> The idea is living at https://hackmd.io/@xuanwo/iceberg_rust_as_file_io >> [hackmd.io] >> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$> >> >> In summary, we can leverage the work from iceberg-rust to help pyiceberg >> in developing a fast and compact file IO system that benefits users with >> specific constraints. >> >> Welcome to join in the discussion. >> >> Xuanwo >> >> https://xuanwo.io/ [xuanwo.io] >> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$> >> >> Xuanwo >> >> https://xuanwo.io/ >> >> Xuanwo >> >> https://xuanwo.io/ >> >> Xuanwo >> >> https://xuanwo.io/ >> >> Xuanwo >> >> https://xuanwo.io/ >> >> >> Xuanwo >> >> https://xuanwo.io/ >> >>