Thanks for starting this discussion.

I know I was the first to mention some of my concerns (which I still have
and would apply to any new major change), but I also think that this is an
avenue that should be explored.

Specifically a native integration would have many benefits for read paths
(in addition to others). I know that the Rust avro reader is
significantly faster, as well as native columnar formats.

So while I do have some concerns about making sure we have enough people to
support this endeavor, I do want to say I think it's a really good idea. My
apologies if I gave the impression otherwise.

I would personally be interested in contributing to and reviewing for a
native Rust library (or CPP, but I think Rust is a much more elegant
language and I'd personally prefer to work in that as it's easier to work
with across systems than C++ imo though I would defer to others on that).

I would also be happy to offer my help and perspective in moving this
forward if need be. But I did want to express my practical concerns so that
we don't have an area of the codebase where there aren't enough people to
help maintain it etc.

But in general I think this is an exciting opportunity, and results have
shown time and time again that native readers / writers are much more
performant.

+1 to using Rust as well (which is a language I know more of than C++ these
days - though both I'd have to brush off my skillset).

Best, Kyle

On Sun, Jun 12, 2022 at 8:20 PM OpenInx <open...@gmail.com> wrote:

> Hi Tao Wu.
>
> I think the apache iceberg community is very consistent in providing the
> Iceberg SDK for native languages.  I am very happy to offer my perspective
> and help if needed when you try to move this thing forward.
>
> On Mon, Jun 13, 2022 at 11:04 AM Wu Tao <wu...@apache.org> wrote:
>
>> Hi, everyone, I'm Tao. I'm currently working on a commercial streaming
>> system that is written in Rust.
>>
>> Actually, I'm planning to implement an Iceberg Rust SDK so that we can
>> have better integration with the existing Iceberg ecosystem. Initially I
>> found https://github.com/oliverdaff/iceberg-rs, but it appears the
>> author hasn't been active lately. So I'm looking to see if the Iceberg
>> community has any consensus on a Rust/C++ SDK (Rust is preferable), and if
>> there is, we'd love to contribute. I believe as Iceberg increases its
>> popularity, there will eventually be more systems that want such libraries.
>> There could have even been some ongoing works without consulting with the
>> community.
>>
>> Additionally, I think the initial Rust/C++ SDK can only support the
>> reader&writer sides of Iceberg. Because there have been plenty of JVM-based
>> query engines out there taking charge of data maintenance. We don't have to
>> rewrite every corner of Iceberg in Rust. That means less engineering work.
>>
>> On 2022/06/08 10:16:05 OpenInx wrote:
>> > As a cloud-native table format standard for the big-data ecosystem,  I
>> > believe supporting multiple languages is the correct direction so that
>> > different languages can connect to the apache iceberg table format.
>> >
>> > But I can also get Kyle's point about lacking enough
>> resources(developers
>> > and reviewers ) to accomplish this goal.  In my mind,  Python, Golang,
>> C++,
>> > Rust , all of them can be regarded as the native language support.  we
>> may
>> > just need to support the Rust SDK and then all of the other languages
>> can
>> > just wrap the Rust SDK to access the table format.
>> >
>> > Anyway,  we will need to wait for the REST catalog finished before we
>> > introduce another languages support , because we can not access the
>> iceberg
>> > table by invoking the JVM catalog interfaces.
>> >
>> > On Tue, Jun 7, 2022 at 4:41 AM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> > > There’s also the question of how useful this would be in practice
>> given
>> > >> the complexity of using C++ (or Rust etc) within some of the major
>> > >> frameworks.
>> > >>
>> > >
>> > > One place this would be useful is for the Arrow's DataSet API [1].  An
>> > > option the Arrow community might be open to is hosting parts of the
>> code
>> > > there (this is what is done for Apache Parquet C++).  This helps
>> shape some
>> > > of the answers to other questions posed (ORC and Parquet are already
>> in the
>> > > Repo, it provides a Filesystem interface, etc).  The project doesn't
>> > > currently consume Avro, and I think the preferred approach is to make
>> a
>> > > clean room Avro parser.  But I agree this is a non-trivial effort to
>> get
>> > > underway.
>> > >
>> > > Another area to consider is compatibility testing.  I think before a
>> third
>> > > officially supported community library is introduced it would be good
>> to
>> > > have a compatibility framework in place to make sure implementations
>> are
>> > > all interpreting the specification correctly.  If there isn't already
>> an
>> > > effort here, I'd like to start contributing something (probably will
>> have
>> > > bandwidth sometime place in Q3).
>> > >
>> > > Thanks,
>> > > -Micah
>> > >
>> > >
>> > > [1] https://arrow.apache.org/docs/cpp/dataset.html
>> > >
>> > > On Sun, Jun 5, 2022 at 11:07 PM Kyle Bendickson <k...@tabular.io>
>> wrote:
>> > >
>> > >> Hi caneGuy,
>> > >>
>> > >> I personally don’t dislike this idea. I understand the performance
>> > >> benefits.
>> > >>
>> > >> But this would be a huge undertaking for the community. We’d need to
>> > >> ensure we had sufficient developer support for reviews (likely one
>> of the
>> > >> biggest issues), as well as a number of other things. Particularly
>> > >> dependencies, package management, etc. We’d also need to scope
>> support down
>> > >> to specific OS / compilers etc.
>> > >>
>> > >> We’d also need to be sure we had adequate developer support from a
>> wide
>> > >> enough range of the community to support the project long term. One
>> issue
>> > >> in open source is that developers will work on something tangential
>> to
>> > >> their project in another repository, but nobody is available to
>> maintain it.
>> > >>
>> > >> There’s also the question of how useful this would be in practice
>> given
>> > >> the complexity of using C++ (or Rust etc) within some of the major
>> > >> frameworks.
>> > >>
>> > >> Again, I’m not opposed to the idea but just trying to be realistic
>> about
>> > >> the realities of such an undertaking. It would need full community
>> support
>> > >> (or at least support from enough community members to be
>> sustainable).
>> > >>
>> > >> If you wanted to make a design doc, the milestones tab in the Iceberg
>> > >> project has some that you might use as reference.
>> > >>
>> > >> *I highly suggest you come to the next community sync and bring this
>> up
>> > >> to the community then.*
>> > >>
>> > >> If you’re not already on the invite list for the monthly community
>> sync,
>> > >> you can get on it by joining the Google group. You’ll receive
>> incites when
>> > >> they go out:
>> > >> https://groups.google.com/g/iceberg-sync
>> > >>
>> > >> Looking forward to seeing you at the next community sync.
>> > >>
>> > >> A design document and/or any prior art would be very helpful as the
>> > >> community sync does discuss many topics (possibly there is existing
>> C++
>> > >> support in StarRocks for Iceberg V1?).
>> > >>
>> > >> Thank you,
>> > >> Kyle Bendickson
>> > >> GitHub: kbendick
>> > >>
>> > >> On Sun, Jun 5, 2022 at 10:44 PM Sam Redai <s...@tabular.io> wrote:
>> > >>
>> > >>> Currently there is no existing effort to develop a C++ package. That
>> > >>> being said I think it would be awesome to have one! If anyone is
>> willing to
>> > >>> start that development effort, I can help with some of the ground
>> work to
>> > >>> kickstart it.
>> > >>>
>> > >>> I would say the first step would be for someone to prepare a
>> high-level
>> > >>> proposal.
>> > >>>
>> > >>> -Sam
>> > >>>
>> > >>> On Sun, Jun 5, 2022 at 11:02 PM 周康 <zhoukang199...@gmail.com>
>> wrote:
>> > >>>
>> > >>>> Hi team
>> > >>>> I am a dev from StarRocks community, and we have supported iceberg
>> v1
>> > >>>> format.
>> > >>>> We are also planning to support v2 format. If there is a C++
>> package,
>> > >>>> it will be very convenient for our implementation.
>> > >>>> At the same time, other c++ computing engines support v2 format
>> will
>> > >>>> also be faster.
>> > >>>>
>> > >>>> Do we have plans to support c++ version sdk?
>> > >>>> --
>> > >>>> caneGuy
>> > >>>>
>> > >>> --
>> > >>>
>> > >>> Sam Redai <s...@tabular.io>
>> > >>>
>> > >>> Developer Advocate  |  Tabular <https://tabular.io/>
>> > >>>
>> > >>> c (267) 226-8606
>> > >>>
>> > >>
>> >
>>
>

-- 

Kyle Bendickson

OSS Developer  |  Tabular <https://tabular.io/>

k...@tabular.io

Reply via email to