I had some ideas to refactor the current FileIO implementations in
PyIceberg to consolidate the behaviors for FsSpec and PyArrow.
https://github.com/apache/iceberg-python/issues/310

There are also some additional concerns around URI parsing based on the
specific FileIO implementation.

Perhaps a good litmus test for a well-defined FileIO interface is to
introduce nanoarrow
<https://arrow.apache.org/nanoarrow/latest/getting-started/python.html> and
see what breaks.

Thanks,
Kevin Liu





On Tue, Aug 6, 2024 at 11:21 AM Kevin Liu <kevin.jq....@gmail.com> wrote:

> > First, we need to establish a workflow that allows us to gradually
> integrate new features into pyiceberg-core. Additionally, pyiceberg should
> be able to import and optionally use classes from pyiceberg-core in an
> additive manner. While developing this workflow, our community will learn
> how to collaborate, manage releases, and more.
>
> +1 I would like to learn more about how to integrate pyiceberg-core into
> PyIceberg. The initial setup should give us a framework for future
> integrations.
>
> I also think there's some prerequisite work on the Pyiceberg side to
> clean up FileIO. A lot of features are built specifically with PyArrow
> dependency, such as writing to the table.
>
>
> Thanks,
> Kevin
>
>
> On Mon, Aug 5, 2024 at 12:19 AM Xuanwo <xua...@apache.org> wrote:
>
>> For the FileIO part, just curious—since Rust's FileIO currently also uses
>> OpenDAL, will there be any functional differences in terms of supported
>> storage services or configurations (like profile_name, signer, etc.)
>> compared to using opendalfs directly in Python in the future? Will Rust's
>> FileIO introduce any customizations/optimizations/extensions beyond what
>> OpenDAL supports?
>>
>>
>> Hi, Honah
>>
>> I believe there should be no functional differences. We can implement the
>> exact same thing for both pyiceberg_core FileIO and opendalfs fsspec FileIO.
>>
>> The main difference I've noticed is in where the configuration parsing
>> occurs.
>>
>> The pyiceberg_core FileIO directly exposes the FileIO class, which can
>> inherently understand iceberg properties. We can pass these properties
>> directly to initialize file IO without any additional effort on the
>> pyiceberg side.
>>
>> However, for opendalfs fsspec FileIO, we need to parse the properties and
>> convert them into appropriate opendalfs options for it to function properly.
>>
>> On Mon, Aug 5, 2024, at 15:04, Honah J. wrote:
>>
>> Thanks Xuanwo for driving this and everyone for discussing,
>>
>> I like the idea of pushing down low-level logic to Iceberg-rust
>> (pyiceberg_core). It’s great to have another option besides PyArrow for
>> reading and writing data in PyIceberg. Thanks, Xuanwo, for moving this
>> forward with the initial PR to add pyiceberg_core.
>>
>> For the FileIO part, just curious—since Rust's FileIO currently also uses
>> OpenDAL, will there be any functional differences in terms of supported
>> storage services or configurations (like profile_name, signer, etc.)
>> compared to using opendalfs directly in Python in the future? Will Rust's
>> FileIO introduce any customizations/optimizations/extensions beyond what
>> OpenDAL supports?
>>
>> Best regards,
>> Honah
>>
>>
>>
>> On Sat, Aug 3, 2024 at 4:12 PM timog...@proton.me.INVALID
>> <timog...@proton.me.invalid> wrote:
>>
>> Fantastic work! I think this is a great direction, and this provides a
>> good base to start iterating.
>>
>> It makes the most sense to me for the Python bindings (and others) to
>> live in the same repo as iceberg-rust, especially at this early stage.
>>
>> - Tim O'Guin
>>
>>
>> -------- Original Message --------
>> On 8/3/24 12:33 AM, Xuanwo wrote:
>>
>>
>> Let's rock! Welcome to take a review:
>> https://github.com/apache/iceberg-rust/pull/518
>>
>> On Sat, Aug 3, 2024, at 12:13, Xuanwo wrote:
>>
>> I also support integrating iceberg-rust with pyiceberg rather than
>> building something new on OpenDAL.
>>
>> OpenDAL backed FileIO will be usable in Python once opendalfs[1], the
>> native fsspec support for OpenDAL, is ready. Users can use opendalfs as a
>> FileIO class directly in pure python. It's not an action item for our
>> community to take.
>>
>> The consensus we've reached is that iceberg-rust will be the core of
>> PyIceberg. The main question now is "How?" How can we implement it without
>> disrupting our valued users? This is my top priority.
>>
>> *Naming is so hard! Let's refer to the new iceberg-rust based pyiceberg
>> core as `pyiceberg-core` until we decide on a project name.*
>>
>> First, we need to establish a workflow that allows us to gradually
>> integrate new features into pyiceberg-core. Additionally, pyiceberg should
>> be able to import and optionally use classes from pyiceberg-core in an
>> additive manner. While developing this workflow, our community will learn
>> how to collaborate, manage releases, and more.
>>
>> We will then incorporate additional Rust-backed features into
>> pyiceberg-core. Eventually, we may make pyiceberg-core our default
>> implementation.
>>
>> My current plan is to implement this pyiceberg-core under iceberg-rust
>> repo under `bindings/python`.
>>
>> - Iceberg-rust is currently under active development. I plan to release
>> pyiceberg-core independently of iceberg-rust's release, as they feature
>> distinct public APIs (and languages!).
>> - Most of the work involves maintaining a few Python stubs and classes,
>> with the majority related to Rust.
>> - The python integration is just a start: we can expect `bindings/nodejs`
>> to happen here too.
>>
>> The setup work has already been started. I will update my PR here once
>> it's ready to review.
>>
>> [1]: https://github.com/fsspec/opendalfs
>>
>> On Sat, Aug 3, 2024, at 09:57, Renjie Liu wrote:
>>
>> Hi:
>>
>> I lean towards implementing pyiceberg's FileIO backed by iceberg-rust's
>> FileIO, rather than directly using OpenDAL. The motivation is that we can
>> use this as a starting point of providing iceberg-rust backed components
>> for pyiceberg, and due to its simplicity, it's a good case. I believe there
>> will be more cases, like Sung mentioned transform in another thread, and
>> table scan mentioned by Fokko.
>>
>> If we want to use OpenDAL directly, we don't need iceberg-rust, since
>> OpenDAL already has python binding:
>> https://opendal.apache.org/docs/python/opendal.html
>>
>> Do you have any experience with this? I see many projects having Rust and
>> Python code in a single repository. There are some exceptions like
>> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core
>> <https://github.com/pydantic/pydantic-core>).
>>
>>
>> Well, first I want to say providing a python binding for a library
>> written in rust is a quite common practice. Just to name a few: opendal
>> <https://github.com/apache/opendal>,  polars
>> <https://github.com/pola-rs/polars>, datafusion
>> <https://github.com/apache/datafusion>, delta-rs
>> <https://github.com/delta-io/delta-rs>. As far as I know, most of them
>> choose to put python binding with rust in the same repo, only
>> datafusion-python <https://github.com/apache/datafusion-python> lives in
>> another, I'm not sure about the reason, maybe it's too large?
>>
>> I haven't tried to implement one before, but pyo3
>> <https://github.com/PyO3> has great documentation, and there are many
>> existing examples in open source we can learn with.
>>
>> On Sat, Aug 3, 2024 at 2:23 AM Fokko Driesprong <fo...@apache.org> wrote:
>>
>> One more thing,
>>
>> About this idea, would you have a more detailed design? For example,
>>  where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What
>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL?
>>
>>
>> Do you have any experience with this? I see many projects having Rust and
>> Python code in a single repository. There are some exceptions like
>> Pydantic (pydantic <https://github.com/pydantic/pydantic>, pydantic-core
>> <https://github.com/pydantic/pydantic-core>).
>>
>> Kind regards,
>> Fokko
>>
>>
>>
>> Op vr 2 aug 2024 om 20:11 schreef Fokko Driesprong <fo...@apache.org>:
>>
>> Thanks for driving this Xuanwo,
>>
>> I already suggested this in my talk back at the Spark Summit to see if we
>> can spark some interest, and it is exciting to see this materialize.
>>
>> For the IO abstraction, I think the FileIO is the best option. We already
>> have the interface
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/__init__.py#L239>
>> in PyIceberg, and also a PyArrowFileIO
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L327>.
>> I must admit that the abstraction is less clear in PyIceberg since we rely
>> so much on Arrow for reading/writing data that it is tightly coupled. I
>> would love to see if we can use OpenDAL for reading/writing data, and
>> Iceberg-rust for pushing down the low-level logic. A while ago I did some
>> profiling on the code, and one of the major issues is that Arrow doesn't
>> support proper field-ID projection. Therefore we have to the Parquet file,
>> and do the schema-evolution and type promotion afterwards in Python
>> <https://github.com/apache/iceberg-python/blob/6c0d307032608967ccd00cfe72d8815e6e7e01cc/pyiceberg/io/pyarrow.py#L1444-L1458>,
>> which causes a lot of congestion on the GIL.
>>
>> Kind regards,
>> Fokko
>>
>> Op vr 2 aug 2024 om 17:46 schreef Jack Ye <yezhao...@gmail.com>:
>>
>> +1 for an OpenDALFileIO
>>
>> -Jack
>>
>> On Fri, Aug 2, 2024 at 8:32 AM Xuanwo <xua...@apache.org> wrote:
>>
>>
>> Hi, renjie
>>
>> Thank you for your support. I'll delve into the details and first build a
>> PoC PR to make it clear.
>>
>> On Fri, Aug 2, 2024, at 22:51, Renjie Liu wrote:
>>
>> Hi:
>>
>> Thanks Xuanwo for raising this.
>>
>> As mentioned in another thread, I think using iceberg-rust in pyiceberg
>> is a good idea.
>>
>> About this idea, would you have a more detailed design? For example,
>> where should the pyo3 codes live, in iceberg-rust or in pyiceberg? What
>> kind of interface should we provide to pyiceberg, FileIO or OpenDAL?
>>
>> I think this is a good first step moving forward to make pyiceberg backed
>> iceberg-rust. In the future we can replace components gradually.
>>
>> On Fri, Aug 2, 2024 at 5:58 PM Xuanwo <xua...@apache.org> wrote:
>>
>>
>> > Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL
>> implementations via pyo3 / fsspec bindings
>> <https://github.com/apache/opendal/issues/4511>?
>>
>> Hi, Raschkowski, good question!
>>
>> It's possible. There is an ongoing project developing fsspec bindings for
>> opendal at https://github.com/fsspec/opendalfs. Once complete, we can
>> directly use opendal through fsspec.
>>
>> This work is unrelated to Pyicberg or Iceberg-rust. Ideally, users should
>> be able to use opendalfs as an alternative implementation of the fsspec
>> AbstractFileSystem class.
>>
>> On Fri, Aug 2, 2024, at 17:44, Will Raschkowski wrote:
>>
>> Xuanwo, would PyIceberg and iceberg-rust share the underlying OpenDAL
>> implementations via pyo3 / fsspec bindings
>> <https://github.com/apache/opendal/issues/4511>?
>>
>>
>> ------------------------------
>>
>> *From:* Joe Stein <crypt...@gmail.com>
>> *Sent:* Thursday, August 1, 2024 3:37 AM
>> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Subject:* Re: [DISCUSS] Use iceberg-rust as pyiceberg file io
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>> Kafka did this with librdkafka and was wildly successful. The underlying
>> bindings being in rust are great with a layer for access in Python +1
>>
>>
>> ~ Joe Stein
>>
>>
>> On Wed, Jul 31, 2024 at 10:29 PM Xuanwo <xua...@apache.org> wrote:
>>
>> Hello everyone
>>
>> I start this thread to discuss the idea about using iceberg-rust as
>> pyiceberg file io.
>>
>> The idea is living at https://hackmd.io/@xuanwo/iceberg_rust_as_file_io
>> [hackmd.io]
>> <https://urldefense.com/v3/__https://hackmd.io/@xuanwo/iceberg_rust_as_file_io__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTPd6ppTI$>
>>
>> In summary, we can leverage the work from iceberg-rust to help pyiceberg
>> in developing a fast and compact file IO system that benefits users with
>> specific constraints.
>>
>> Welcome to join in the discussion.
>>
>> Xuanwo
>>
>> https://xuanwo.io/ [xuanwo.io]
>> <https://urldefense.com/v3/__https://xuanwo.io/__;!!NkS9JGVQ2sDq!7Js41FIzh2smsAOySXrKd527DXCmXdrwV8Uov8TIdQqLRcsCkfPnHzfbxbX_xctpoNpYw2XGfrduTNspr1jI$>
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>

Reply via email to