Part 2 of the article "Our journey at F5 with Apache Arrow"

2023-06-16 Thread Laurent Quérel
Hi everyone,

I wrote the second part of the article "Our Journey at F5 with Apache
Arrow," published on April 11, 2023. This article discusses three
techniques that have enabled us to enhance both the compression ratio and
memory usage of Apache Arrow buffers within the current version of the OTel
Arrow protocol.

The following PR is available to gather your feedback.
https://github.com/apache/arrow-site/pull/369

You can also read the article on Google Docs if you prefer.
https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing

Cheers, Laurent Querel


Re: OpenTelemetry + Arrow

2023-04-11 Thread Laurent Quérel
Thank you very much Andrew.

I should be able to work on the second article next week and I will follow
the same process.

Cheers, Laurent

On Tue, Apr 11, 2023 at 4:31 AM Andrew Lamb  wrote:

> The blog post is now live on the arrow site [1]
>
> Thanks again Laurent
>
> [1]:
>
> https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/
>
> On Sun, Apr 2, 2023 at 9:07 PM Laurent Quérel 
> wrote:
>
> > Hi Andrew,
> >
> > The feedback seems to be good so I created a PR.
> >
> > https://github.com/apache/arrow-site/pull/340
> >
> > Best regards,
> >
> > Laurent Querel
> >
> > On Thu, Mar 30, 2023 at 3:28 PM Laurent Quérel  >
> > wrote:
> >
> > > I'm glad to know that the article has been well-received. In the second
> > > article, I will allocate a dedicated section to summarize the various
> > > challenges encountered when using Arrow for this type of project.
> > >
> > > @Matt, I want to express my gratitude for your continuous support
> > > throughout this project. Your contributions and refinements to the
> Arrow
> > Go
> > > library have enabled me to make significant progress with minimal
> > obstacles.
> > >
> > > Best Regards, Laurent
> > >
> > > On Thu, Mar 30, 2023 at 2:24 PM Matt Topol 
> > wrote:
> > >
> > >> +1 (non -binding)
> > >>
> > >> I'm glad others on here are finding this as useful and interesting as
> I
> > >> did.
> > >>
> > >> Great job Laurent!
> > >>
> > >> --Matt
> > >>
> > >> On Thu, Mar 30, 2023, 3:26 PM Raphael Taylor-Davies
> > >>  wrote:
> > >>
> > >> > Hi Laurent,
> > >> >
> > >> > I gave the first blog post a read and I also really like it and
> would
> > be
> > >> > +1 on publishing it, nice work.
> > >> >
> > >> > I would also like to echo Will's sentiment that getting real-world
> > case
> > >> > studies for the more complex Arrow schemas is invaluable and will
> help
> > >> > drive improvements in this space, so thank you for driving this
> > forward.
> > >> >
> > >> > Kind Regards,
> > >> >
> > >> > Raphael
> > >> >
> > >> > On 30/03/2023 19:52, Will Jones wrote:
> > >> > > Hi Laurent,
> > >> > >
> > >> > > I have read the first post and I really like it. I'd be +1 on
> > >> publishing
> > >> > > these to the blog. I'm interested to read the second one when it's
> > >> > finished.
> > >> > >
> > >> > > IMO the blog could use more examples of using Arrow that's not
> > >> building a
> > >> > > data frame library / query engine, and I appreciate that this blog
> > >> > provides
> > >> > > advice for some of the trickier parts of working with complex
> Arrow
> > >> > > schemas. I think this will also provide a good concrete use case
> for
> > >> us
> > >> > to
> > >> > > think about improving the ecosystem's support for nested data.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Will Jones
> > >> > >
> > >> > > On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel <
> > >> > laurent.que...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Hello everyone,
> > >> > >>
> > >> > >> I was wondering if the Apache Arrow community would be interested
> > in
> > >> > >> featuring a two-part article series on their blog, discussing the
> > >> > >> experiences and insights gained from an experimental version of
> the
> > >> > >> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main
> > >> > author of
> > >> > >> the OTLP Arrow specification
> > >> > >> <
> > >> >
> > >>
> >
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > >> > >>> ,
> > >> > >> the reference implementation otlp-arrow-adapter
> > >> > >> <https://github.com/f5/otel-arro

Re: OpenTelemetry + Arrow

2023-04-02 Thread Laurent Quérel
Hi Andrew,

The feedback seems to be good so I created a PR.

https://github.com/apache/arrow-site/pull/340

Best regards,

Laurent Querel

On Thu, Mar 30, 2023 at 3:28 PM Laurent Quérel 
wrote:

> I'm glad to know that the article has been well-received. In the second
> article, I will allocate a dedicated section to summarize the various
> challenges encountered when using Arrow for this type of project.
>
> @Matt, I want to express my gratitude for your continuous support
> throughout this project. Your contributions and refinements to the Arrow Go
> library have enabled me to make significant progress with minimal obstacles.
>
> Best Regards, Laurent
>
> On Thu, Mar 30, 2023 at 2:24 PM Matt Topol  wrote:
>
>> +1 (non -binding)
>>
>> I'm glad others on here are finding this as useful and interesting as I
>> did.
>>
>> Great job Laurent!
>>
>> --Matt
>>
>> On Thu, Mar 30, 2023, 3:26 PM Raphael Taylor-Davies
>>  wrote:
>>
>> > Hi Laurent,
>> >
>> > I gave the first blog post a read and I also really like it and would be
>> > +1 on publishing it, nice work.
>> >
>> > I would also like to echo Will's sentiment that getting real-world case
>> > studies for the more complex Arrow schemas is invaluable and will help
>> > drive improvements in this space, so thank you for driving this forward.
>> >
>> > Kind Regards,
>> >
>> > Raphael
>> >
>> > On 30/03/2023 19:52, Will Jones wrote:
>> > > Hi Laurent,
>> > >
>> > > I have read the first post and I really like it. I'd be +1 on
>> publishing
>> > > these to the blog. I'm interested to read the second one when it's
>> > finished.
>> > >
>> > > IMO the blog could use more examples of using Arrow that's not
>> building a
>> > > data frame library / query engine, and I appreciate that this blog
>> > provides
>> > > advice for some of the trickier parts of working with complex Arrow
>> > > schemas. I think this will also provide a good concrete use case for
>> us
>> > to
>> > > think about improving the ecosystem's support for nested data.
>> > >
>> > > Best,
>> > >
>> > > Will Jones
>> > >
>> > > On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel <
>> > laurent.que...@gmail.com>
>> > > wrote:
>> > >
>> > >> Hello everyone,
>> > >>
>> > >> I was wondering if the Apache Arrow community would be interested in
>> > >> featuring a two-part article series on their blog, discussing the
>> > >> experiences and insights gained from an experimental version of the
>> > >> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main
>> > author of
>> > >> the OTLP Arrow specification
>> > >> <
>> >
>> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
>> > >>> ,
>> > >> the reference implementation otlp-arrow-adapter
>> > >> <https://github.com/f5/otel-arrow-adapter>, and the two articles
>> (see
>> > >> links
>> > >> below), I believe that fostering collaboration between open-source
>> > projects
>> > >> like these is essential and mutually beneficial.
>> > >>
>> > >> These articles would serve as a fitting complement to the three
>> > >> introductory articles that Andrew Lamb and Raphael Taylor-Davies
>> > >> co-authored. They delve into the practical aspects of integrating
>> Apache
>> > >> Arrow into an existing project, as well as the process of converting
>> a
>> > >> hierarchical data model into its Arrow representation. The first
>> article
>> > >> examines various mapping techniques for aligning an existing data
>> model
>> > >> with the corresponding Arrow representation, while the second article
>> > >> explores an adaptive schema technique that I implemented in the
>> > library's
>> > >> final version in greater depth. Although the second article is still
>> > under
>> > >> development, the core framework description is already in place.
>> > >>
>> > >> What are your thoughts on this proposal?
>> > >>
>> > >> Article 1:
>> > >>
>> > >>
>> >
>> https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing
>> > >>
>> > >> Article 2 (WIP):
>> > >>
>> > >>
>> >
>> https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing
>> > >>
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Laurent Quérel
>> > >>
>> > >> --
>> > >> Laurent Quérel
>> > >>
>> >
>>
>
>
> --
> Laurent Quérel
>
>


-- 
Laurent Quérel


Re: OpenTelemetry + Arrow

2023-03-30 Thread Laurent Quérel
I'm glad to know that the article has been well-received. In the second
article, I will allocate a dedicated section to summarize the various
challenges encountered when using Arrow for this type of project.

@Matt, I want to express my gratitude for your continuous support
throughout this project. Your contributions and refinements to the Arrow Go
library have enabled me to make significant progress with minimal obstacles.

Best Regards, Laurent

On Thu, Mar 30, 2023 at 2:24 PM Matt Topol  wrote:

> +1 (non -binding)
>
> I'm glad others on here are finding this as useful and interesting as I
> did.
>
> Great job Laurent!
>
> --Matt
>
> On Thu, Mar 30, 2023, 3:26 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi Laurent,
> >
> > I gave the first blog post a read and I also really like it and would be
> > +1 on publishing it, nice work.
> >
> > I would also like to echo Will's sentiment that getting real-world case
> > studies for the more complex Arrow schemas is invaluable and will help
> > drive improvements in this space, so thank you for driving this forward.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > On 30/03/2023 19:52, Will Jones wrote:
> > > Hi Laurent,
> > >
> > > I have read the first post and I really like it. I'd be +1 on
> publishing
> > > these to the blog. I'm interested to read the second one when it's
> > finished.
> > >
> > > IMO the blog could use more examples of using Arrow that's not
> building a
> > > data frame library / query engine, and I appreciate that this blog
> > provides
> > > advice for some of the trickier parts of working with complex Arrow
> > > schemas. I think this will also provide a good concrete use case for us
> > to
> > > think about improving the ecosystem's support for nested data.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel <
> > laurent.que...@gmail.com>
> > > wrote:
> > >
> > >> Hello everyone,
> > >>
> > >> I was wondering if the Apache Arrow community would be interested in
> > >> featuring a two-part article series on their blog, discussing the
> > >> experiences and insights gained from an experimental version of the
> > >> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main
> > author of
> > >> the OTLP Arrow specification
> > >> <
> >
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > >>> ,
> > >> the reference implementation otlp-arrow-adapter
> > >> <https://github.com/f5/otel-arrow-adapter>, and the two articles (see
> > >> links
> > >> below), I believe that fostering collaboration between open-source
> > projects
> > >> like these is essential and mutually beneficial.
> > >>
> > >> These articles would serve as a fitting complement to the three
> > >> introductory articles that Andrew Lamb and Raphael Taylor-Davies
> > >> co-authored. They delve into the practical aspects of integrating
> Apache
> > >> Arrow into an existing project, as well as the process of converting a
> > >> hierarchical data model into its Arrow representation. The first
> article
> > >> examines various mapping techniques for aligning an existing data
> model
> > >> with the corresponding Arrow representation, while the second article
> > >> explores an adaptive schema technique that I implemented in the
> > library's
> > >> final version in greater depth. Although the second article is still
> > under
> > >> development, the core framework description is already in place.
> > >>
> > >> What are your thoughts on this proposal?
> > >>
> > >> Article 1:
> > >>
> > >>
> >
> https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing
> > >>
> > >> Article 2 (WIP):
> > >>
> > >>
> >
> https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Laurent Quérel
> > >>
> > >> --
> > >> Laurent Quérel
> > >>
> >
>


-- 
Laurent Quérel


OpenTelemetry + Arrow

2023-03-30 Thread Laurent Quérel
Hello everyone,

I was wondering if the Apache Arrow community would be interested in
featuring a two-part article series on their blog, discussing the
experiences and insights gained from an experimental version of the
OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main author of
the OTLP Arrow specification
<https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md>,
the reference implementation otlp-arrow-adapter
<https://github.com/f5/otel-arrow-adapter>, and the two articles (see links
below), I believe that fostering collaboration between open-source projects
like these is essential and mutually beneficial.

These articles would serve as a fitting complement to the three
introductory articles that Andrew Lamb and Raphael Taylor-Davies
co-authored. They delve into the practical aspects of integrating Apache
Arrow into an existing project, as well as the process of converting a
hierarchical data model into its Arrow representation. The first article
examines various mapping techniques for aligning an existing data model
with the corresponding Arrow representation, while the second article
explores an adaptive schema technique that I implemented in the library's
final version in greater depth. Although the second article is still under
development, the core framework description is already in place.

What are your thoughts on this proposal?

Article 1:
https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing

Article 2 (WIP):
https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing


Best regards,

Laurent Quérel

-- 
Laurent Quérel


Re: Request for Patch release of 10.0.2

2022-12-16 Thread Laurent Quérel
@Matt. Thanks and I also wish you a good vacation and a good end of the
year!

--Laurent

On Fri, Dec 16, 2022 at 9:31 AM Matt Topol  wrote:

> @Laurent: I just merged the patches into the "maint-10.0.x" branch. You
> should be able to use `go mod edit -replace` to point at that branch
> instead of the v10.0.1 version tag and pick up the patches.
>
> Once v11 is released in mid-January, v11 will also contain these fixes and
> you'll be able to remove the "replace" directive. Feel free to open a new
> Github issue if you run into any problems! Thanks for your patience with
> this and I hope you enjoy your holidays and end of the year!
>
> --Matt
>
> On Thu, Dec 15, 2022 at 3:31 PM Laurent Quérel 
> wrote:
>
> > @Matt Thanks
> >
> > On Thu, Dec 15, 2022 at 9:51 AM Matt Topol 
> wrote:
> >
> > > I've created a PR for the cherry-picked changes here:
> > > https://github.com/apache/arrow/pull/14980
> > >
> > > @Kou or @Neal could one of you take a look and approve the PR before I
> > > merge it? It feels like it should at least get approved by someone
> > before I
> > > merge to the maintenance branch, lol.
> > >
> > > --Matt
> > >
> > > On Thu, Dec 15, 2022 at 11:47 AM Neal Richardson <
> > > neal.p.richard...@gmail.com> wrote:
> > >
> > > > I don't see a problem cherry-picking commits to the maintenance
> > > > branch--seems like that's what it should be for, right?
> > > >
> > > > Neal
> > > >
> > > > On Thu, Dec 15, 2022 at 11:17 AM Matt Topol 
> > > > wrote:
> > > >
> > > > > @Kou It looks like we're going to just have a branch as an
> > un-official
> > > > > patched version that can solve Laurent's issue until the v11
> release
> > > > > happens in mid-january. While a full v10.0.2 patch release would be
> > > > > preferable, it doesn't seem anyone is available to serve as release
> > > > manager
> > > > > given the proximity to holidays, plus we're just gonna have the v11
> > > > release
> > > > > in a few weeks anyways.
> > > > >
> > > > > So that brings me to the question for everyone I guess, should I
> > > create a
> > > > > new branch? Or should I just cherry-pick the fix as a new commit
> into
> > > the
> > > > > "maint-10.x.x" branch? It feels better to have it as a branch on
> the
> > > > > official repo rather than on a fork.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > On Mon, Dec 12, 2022 at 7:09 PM Sutou Kouhei 
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm sorry but I can't do this (serve as release manager) in
> > > > > > this month. But I can help a committer or PMC member who
> > > > > > wants to serve as release manager for 10.0.2.
> > > > > >
> > > > > > FYI: ASF says that a committer can serve as release manager:
> > > > > > https://infra.apache.org/release-publishing.html#releasemanager
> > > > > >
> > > > > > A committer can't do some release tasks such as signing and
> > > > > > uploading. I can do them instead as a PMC member.
> > > > > >
> > > > > > Is there a committer or PMC member who wants to serve as
> > > > > > release manager for 10.0.2?
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > kou
> > > > > >
> > > > > > In <
> > > cajdg-vqxsbifujnz1ndx+t0dudhxmujr4ea6htvi759y6e-...@mail.gmail.com
> > > > >
> > > > > >   "Re: Request for Patch release of 10.0.2" on Mon, 12 Dec 2022
> > > > 10:09:27
> > > > > > -0500,
> > > > > >   Matthew Topol  wrote:
> > > > > >
> > > > > > > I'm in favor of this, can someone from the PMC please comment
> to
> > > the
> > > > > > > viability of this? I know we have the v11 release coming up in
> > the
> > > > next
> > > > > > > month or so and we'd need someone to be the release manager for
> > &

Re: Request for Patch release of 10.0.2

2022-12-15 Thread Laurent Quérel
@Matt Thanks

On Thu, Dec 15, 2022 at 9:51 AM Matt Topol  wrote:

> I've created a PR for the cherry-picked changes here:
> https://github.com/apache/arrow/pull/14980
>
> @Kou or @Neal could one of you take a look and approve the PR before I
> merge it? It feels like it should at least get approved by someone before I
> merge to the maintenance branch, lol.
>
> --Matt
>
> On Thu, Dec 15, 2022 at 11:47 AM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
>
> > I don't see a problem cherry-picking commits to the maintenance
> > branch--seems like that's what it should be for, right?
> >
> > Neal
> >
> > On Thu, Dec 15, 2022 at 11:17 AM Matt Topol 
> > wrote:
> >
> > > @Kou It looks like we're going to just have a branch as an un-official
> > > patched version that can solve Laurent's issue until the v11 release
> > > happens in mid-january. While a full v10.0.2 patch release would be
> > > preferable, it doesn't seem anyone is available to serve as release
> > manager
> > > given the proximity to holidays, plus we're just gonna have the v11
> > release
> > > in a few weeks anyways.
> > >
> > > So that brings me to the question for everyone I guess, should I
> create a
> > > new branch? Or should I just cherry-pick the fix as a new commit into
> the
> > > "maint-10.x.x" branch? It feels better to have it as a branch on the
> > > official repo rather than on a fork.
> > >
> > > Thoughts?
> > >
> > > On Mon, Dec 12, 2022 at 7:09 PM Sutou Kouhei 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm sorry but I can't do this (serve as release manager) in
> > > > this month. But I can help a committer or PMC member who
> > > > wants to serve as release manager for 10.0.2.
> > > >
> > > > FYI: ASF says that a committer can serve as release manager:
> > > > https://infra.apache.org/release-publishing.html#releasemanager
> > > >
> > > > A committer can't do some release tasks such as signing and
> > > > uploading. I can do them instead as a PMC member.
> > > >
> > > > Is there a committer or PMC member who wants to serve as
> > > > release manager for 10.0.2?
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In <
> cajdg-vqxsbifujnz1ndx+t0dudhxmujr4ea6htvi759y6e-...@mail.gmail.com
> > >
> > > >   "Re: Request for Patch release of 10.0.2" on Mon, 12 Dec 2022
> > 10:09:27
> > > > -0500,
> > > >   Matthew Topol  wrote:
> > > >
> > > > > I'm in favor of this, can someone from the PMC please comment to
> the
> > > > > viability of this? I know we have the v11 release coming up in the
> > next
> > > > > month or so and we'd need someone to be the release manager for
> > > > performing
> > > > > a v10.0.2 patch release. So I'm not sure whether or not this is a
> > > viable
> > > > > option.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Fri, Dec 9, 2022 at 1:21 PM Laurent Quérel <
> > > laurent.que...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I would like to propose the release of a patch containing the
> > > following
> > > > two
> > > > >> PRs:
> > > > >> - https://github.com/apache/arrow/pull/14892
> > > > >> - https://github.com/apache/arrow/pull/14904
> > > > >>
> > > > >> The first one fixes a memory leak issue when the built-in
> > compression
> > > is
> > > > >> enabled.
> > > > >> The other one fixes an issue with empty map.
> > > > >>
> > > > >> Thanks to Matt Topol for fixing these two bugs quickly.
> > > > >>
> > > > >> I am currently working on a use of Apache Arrow in the Open
> > Telemetry
> > > > >> project. A first beta is planned in December, so I would like to
> > have
> > > an
> > > > >> official patch (see this repo for more details
> > > > >> https://github.com/f5/otel-arrow-adapter)
> > > > >>
> > > > >> --Laurent Quérel
> > > > >>
> > > >
> > >
> >
>


-- 
Laurent Quérel


Re: Request for Patch release of 10.0.2

2022-12-12 Thread Laurent Quérel
@matthew It could work for me if we stay within this kind of time frame.
However, this memory leak problem could affect other users who may not be
aware of the problem or the fix. Having an official patch version would be
more effective in spreading this fix within the community I think.

—Laureby

On Mon, Dec 12, 2022 at 7:36 AM Matthew Topol 
wrote:

> @Laurent: Would it be sufficient if we cherry-pick the bugfixes to a branch
> off of the v10 release and then you can use a replace directive in the
> go.mod to point at that particular commit until the v11 release comes out
> in 4 - 6 weeks (give or take given the end of year holidays)?
>
> --Matt
>
> On Mon, Dec 12, 2022 at 10:09 AM Matthew Topol 
> wrote:
>
> > I'm in favor of this, can someone from the PMC please comment to the
> > viability of this? I know we have the v11 release coming up in the next
> > month or so and we'd need someone to be the release manager for
> performing
> > a v10.0.2 patch release. So I'm not sure whether or not this is a viable
> > option.
> >
> > --Matt
> >
> > On Fri, Dec 9, 2022 at 1:21 PM Laurent Quérel 
> > wrote:
> >
> >> Hi all,
> >>
> >> I would like to propose the release of a patch containing the following
> >> two
> >> PRs:
> >> - https://github.com/apache/arrow/pull/14892
> >> - https://github.com/apache/arrow/pull/14904
> >>
> >> The first one fixes a memory leak issue when the built-in compression is
> >> enabled.
> >> The other one fixes an issue with empty map.
> >>
> >> Thanks to Matt Topol for fixing these two bugs quickly.
> >>
> >> I am currently working on a use of Apache Arrow in the Open Telemetry
> >> project. A first beta is planned in December, so I would like to have an
> >> official patch (see this repo for more details
> >> https://github.com/f5/otel-arrow-adapter)
> >>
> >> --Laurent Quérel
> >>
> >
>
-- 
Laurent Quérel


Request for Patch release of 10.0.2

2022-12-09 Thread Laurent Quérel
Hi all,

I would like to propose the release of a patch containing the following two
PRs:
- https://github.com/apache/arrow/pull/14892
- https://github.com/apache/arrow/pull/14904

The first one fixes a memory leak issue when the built-in compression is
enabled.
The other one fixes an issue with empty map.

Thanks to Matt Topol for fixing these two bugs quickly.

I am currently working on a use of Apache Arrow in the Open Telemetry
project. A first beta is planned in December, so I would like to have an
official patch (see this repo for more details
https://github.com/f5/otel-arrow-adapter)

--Laurent Quérel


Re: Request for Patch release of 10.0.1

2022-11-08 Thread Laurent Quérel
When is the 10.0.1 release expected?

Thank you very much to all.

-- Laurent

On Tue, Nov 8, 2022 at 7:41 AM Jacob Wujciak 
wrote:

> As joris said, 10.0.1 will happen. In addition to pyarrow there are also
> some patches for the R package that are needed due to changes in
> dependencies post 10.0.0 which are/will cause issues on CRAN.
>
> On Tue, Nov 8, 2022 at 4:25 PM Matt Topol  wrote:
>
> > Hey all,
> >
> > On JIRA[1] there was a request by Laurent who is working on the Open
> > Telemetry Beta using Arrow as their transport to release a fix in a patch
> > release as v10.0.1. I've opened up a draft PR[2] which cherry-picks the
> > change onto the maint-10.0.0 branch as a preliminary step to getting
> > feedback and potentially calling for a Vote to perform such a release.
> >
> > So I guess, this is where I ask for people's opinions and thoughts as far
> > as potentially performing this patch release as it would need to be voted
> > on by the PMC to do so.
> >
> > Thoughts? Opinions? Thanks everyone!
> >
> > @Laurent: Since you said you were subscribed to the mailing list, any
> > additional color you can provide as far as needing the patch release vs
> > using master etc. would be beneficial. Thanks!
> >
> > --Matt
> >
> > [1]: https://issues.apache.org/jira/browse/ARROW-18274
> > [2]: https://github.com/apache/arrow/pull/14608
> >
>


-- 
Laurent Quérel


Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Julian,

My intermediate representation is indeed an API and does not define a
specific physical format (which could be different from one language to
another, or even not exist at all in some cases). That being said, I didn't
understand your feedback and I'm sure there's something to dig into here.
Could you please give me more details or direct me to some literature
presenting this idea in more detail.

Best,
Laurent


On Thu, Jul 28, 2022 at 6:09 PM Julian Hyde  wrote:

> If the 'row-oriented format' is an API rather than a physical data
> representation then it can be implemented via coroutines and could
> therefore have less scattered patterns of read/write access.
>
> By 'coroutines' I'm being rather imprecise, but I hope you get the
> general idea. An asynchronous API (with some buffering) is potentially
> much more efficient than a physical format.
>
> On Thu, Jul 28, 2022 at 5:43 PM Gavin Ray  wrote:
> >
> > This is essentially the same idea as the proposal here I think --
> > row/map-based representation & conversion functions for ease of use:
> >
> > [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry,
> > increase adoption/audience and productivity. · Issue #12618 ·
> apache/arrow
> > (github.com) <https://github.com/apache/arrow/issues/12618>
> >
> > Definitely a worthwhile pursuit IMO.
> >
> > On Thu, Jul 28, 2022 at 4:46 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > > I just wanted to chime in that we already do have a form of
> row-oriented
> > > storage inside of `arrow/compute/row/row_internal.h`. It is used to
> store
> > > rows inside of GroupBy and Join within Acero. We also have utilities
> for
> > > converting to/from columnar storage (and AVX2 implementations of these
> > > conversions) inside of `arrow/compute/row/encode_internal.h`. Would it
> be
> > > useful to standardize this row-oriented format?
> > >
> > > As far as I understand fixed-width rows would be trivially convertible
> > > into this representation (just a pointer to your array of structs),
> while
> > > variable-width rows would need a little bit of massaging (though not
> too
> > > much) to be put into this representation.
> > >
> > > Sasha Krassovsky
> > >
> > > > On Jul 28, 2022, at 1:10 PM, Laurent Quérel <
> laurent.que...@gmail.com>
> > > wrote:
> > > >
> > > > Thank you Micah for a very clear summary of the intent behind this
> > > > proposal. Indeed, I think that clarifying from the beginning that
> this
> > > > approach aims at facilitating experimentation more than efficiency in
> > > terms
> > > > of performance of the transformation phase would have helped to
> better
> > > > understand my objective.
> > > >
> > > > Regarding your question, I don't think there is a specific technical
> > > reason
> > > > for such an integration in the core library. I was just thinking
> that it
> > > > would make this infrastructure easier to find for the users and that
> this
> > > > topic was general enough to find its place in the standard library.
> > > >
> > > > Best,
> > > > Laurent
> > > >
> > > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Laurent,
> > > >> I'm retitling this thread to include the specific languages you
> seem to
> > > be
> > > >> targeting in the subject line to hopefully get more eyes from
> > > maintainers
> > > >> in those languages.
> > > >>
> > > >> Thanks for clarifying the goals.  If I can restate my
> understanding, the
> > > >> intended use-case here is to provide easy (from the developer point
> of
> > > >> view) adaptation of row based formats to Arrow.  The means of
> achieving
> > > >> this is creating an API for a row-base structure, and having utility
> > > >> classes that can manipulate the interface to build up batches (there
> > > are no
> > > >> serialization or in memory spec associated with this API).  People
> > > wishing
> > > >> to integrate a specific row based format, can extend that API at
> > > whatever
> > > >> level makes sense for the format.
> > > >>
> >

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Gavin,
I was not aware of this initiative but indeed, these two proposals have
much in common. The implementation I am working on is available here
https://github.com/lquerel/otel-arrow-adapter (directory pkg/air). I would
be happy to get your feedback and identify with you the possible gaps to
cover your specific use case.
Best,
Laurent

On Thu, Jul 28, 2022 at 5:43 PM Gavin Ray  wrote:

> This is essentially the same idea as the proposal here I think --
> row/map-based representation & conversion functions for ease of use:
>
> [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry,
> increase adoption/audience and productivity. · Issue #12618 · apache/arrow
> (github.com) <https://github.com/apache/arrow/issues/12618>
>
> Definitely a worthwhile pursuit IMO.
>
> On Thu, Jul 28, 2022 at 4:46 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > Hi everyone,
> > I just wanted to chime in that we already do have a form of row-oriented
> > storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> > rows inside of GroupBy and Join within Acero. We also have utilities for
> > converting to/from columnar storage (and AVX2 implementations of these
> > conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> > useful to standardize this row-oriented format?
> >
> > As far as I understand fixed-width rows would be trivially convertible
> > into this representation (just a pointer to your array of structs), while
> > variable-width rows would need a little bit of massaging (though not too
> > much) to be put into this representation.
> >
> > Sasha Krassovsky
> >
> > > On Jul 28, 2022, at 1:10 PM, Laurent Quérel 
> > wrote:
> > >
> > > Thank you Micah for a very clear summary of the intent behind this
> > > proposal. Indeed, I think that clarifying from the beginning that this
> > > approach aims at facilitating experimentation more than efficiency in
> > terms
> > > of performance of the transformation phase would have helped to better
> > > understand my objective.
> > >
> > > Regarding your question, I don't think there is a specific technical
> > reason
> > > for such an integration in the core library. I was just thinking that
> it
> > > would make this infrastructure easier to find for the users and that
> this
> > > topic was general enough to find its place in the standard library.
> > >
> > > Best,
> > > Laurent
> > >
> > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > >> Hi Laurent,
> > >> I'm retitling this thread to include the specific languages you seem
> to
> > be
> > >> targeting in the subject line to hopefully get more eyes from
> > maintainers
> > >> in those languages.
> > >>
> > >> Thanks for clarifying the goals.  If I can restate my understanding,
> the
> > >> intended use-case here is to provide easy (from the developer point of
> > >> view) adaptation of row based formats to Arrow.  The means of
> achieving
> > >> this is creating an API for a row-base structure, and having utility
> > >> classes that can manipulate the interface to build up batches (there
> > are no
> > >> serialization or in memory spec associated with this API).  People
> > wishing
> > >> to integrate a specific row based format, can extend that API at
> > whatever
> > >> level makes sense for the format.
> > >>
> > >> I think this would be useful infrastructure as long as it was made
> clear
> > >> that in many cases this wouldn't be the most efficient way to convert
> to
> > >> Arrow from other formats.
> > >>
> > >> I don't work much with either the Rust or Go implementation, so I
> can't
> > >> speak to if there is maintainer support for incorporating the changes
> > >> directly in Arrow.  Is there any technical reasons for preferring to
> > have
> > >> this included directly in Arrow vs a separate library?
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <
> > laurent.que...@gmail.com>
> > >> wrote:
> > >>
> > >>> Far be it from me to think that I know more than Jorge or Wes on this
> > >>> subject. Sorry if my post gives that perce

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Sasha,
Thank you very much for this informative comment. It's interesting to see
another use of a row-based API in the context of a query engine. I think
that there is some thought to be given to whether or not it is possible to
converge these two use cases into a single public row-based API.

As a first reaction I would say that it is not necessarily easy to
reconcile because the constraints and the goals to be optimized are
relatively disjoint. If you see a way to do it I'm extremely interested.

If I understand correctly, in your case, you want to optimize the
conversion from column to row representation and vice versa (a kind of
bidirectional projection). Having a SIMD implementation of these
conversions is just fantastic. However it seems that in your case there is
no support for nested types yet and I feel like there is no public API to
build rows in a simple and ergonomic way outside this bridge with the
column-based representation.

In the use case I'm trying to solve, the criteria to optimize are 1) expose
a row-based API that offers the least amount of friction in the process of
converting any row-based source to Arrow, which implies an easy-to-use API
and support for nested types, 2) make it easy to create an efficient Arrow
schema by automating dictionary creation and multi-column sorting in a way
that makes Arrow easy to use for the casual user.

The criteria to be optimized seem relatively disjointed to me but again I
would be willing to dig with you a solution that offers a good compromise
for these two use cases.

Best,
Laurent



On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky 
wrote:

> Hi everyone,
> I just wanted to chime in that we already do have a form of row-oriented
> storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> rows inside of GroupBy and Join within Acero. We also have utilities for
> converting to/from columnar storage (and AVX2 implementations of these
> conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> useful to standardize this row-oriented format?
>
> As far as I understand fixed-width rows would be trivially convertible
> into this representation (just a pointer to your array of structs), while
> variable-width rows would need a little bit of massaging (though not too
> much) to be put into this representation.
>
> Sasha Krassovsky
>
> > On Jul 28, 2022, at 1:10 PM, Laurent Quérel 
> wrote:
> >
> > Thank you Micah for a very clear summary of the intent behind this
> > proposal. Indeed, I think that clarifying from the beginning that this
> > approach aims at facilitating experimentation more than efficiency in
> terms
> > of performance of the transformation phase would have helped to better
> > understand my objective.
> >
> > Regarding your question, I don't think there is a specific technical
> reason
> > for such an integration in the core library. I was just thinking that it
> > would make this infrastructure easier to find for the users and that this
> > topic was general enough to find its place in the standard library.
> >
> > Best,
> > Laurent
> >
> > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield 
> > wrote:
> >
> >> Hi Laurent,
> >> I'm retitling this thread to include the specific languages you seem to
> be
> >> targeting in the subject line to hopefully get more eyes from
> maintainers
> >> in those languages.
> >>
> >> Thanks for clarifying the goals.  If I can restate my understanding, the
> >> intended use-case here is to provide easy (from the developer point of
> >> view) adaptation of row based formats to Arrow.  The means of achieving
> >> this is creating an API for a row-base structure, and having utility
> >> classes that can manipulate the interface to build up batches (there
> are no
> >> serialization or in memory spec associated with this API).  People
> wishing
> >> to integrate a specific row based format, can extend that API at
> whatever
> >> level makes sense for the format.
> >>
> >> I think this would be useful infrastructure as long as it was made clear
> >> that in many cases this wouldn't be the most efficient way to convert to
> >> Arrow from other formats.
> >>
> >> I don't work much with either the Rust or Go implementation, so I can't
> >> speak to if there is maintainer support for incorporating the changes
> >> directly in Arrow.  Is there any technical reasons for preferring to
> have
> >> this included directly in Arrow vs a separate library?
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Thu, J

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Thank you Micah for a very clear summary of the intent behind this
proposal. Indeed, I think that clarifying from the beginning that this
approach aims at facilitating experimentation more than efficiency in terms
of performance of the transformation phase would have helped to better
understand my objective.

Regarding your question, I don't think there is a specific technical reason
for such an integration in the core library. I was just thinking that it
would make this infrastructure easier to find for the users and that this
topic was general enough to find its place in the standard library.

Best,
Laurent

On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield 
wrote:

> Hi Laurent,
> I'm retitling this thread to include the specific languages you seem to be
> targeting in the subject line to hopefully get more eyes from maintainers
> in those languages.
>
> Thanks for clarifying the goals.  If I can restate my understanding, the
> intended use-case here is to provide easy (from the developer point of
> view) adaptation of row based formats to Arrow.  The means of achieving
> this is creating an API for a row-base structure, and having utility
> classes that can manipulate the interface to build up batches (there are no
> serialization or in memory spec associated with this API).  People wishing
> to integrate a specific row based format, can extend that API at whatever
> level makes sense for the format.
>
> I think this would be useful infrastructure as long as it was made clear
> that in many cases this wouldn't be the most efficient way to convert to
> Arrow from other formats.
>
> I don't work much with either the Rust or Go implementation, so I can't
> speak to if there is maintainer support for incorporating the changes
> directly in Arrow.  Is there any technical reasons for preferring to have
> this included directly in Arrow vs a separate library?
>
> Cheers,
> Micah
>
> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel 
> wrote:
>
> > Far be it from me to think that I know more than Jorge or Wes on this
> > subject. Sorry if my post gives that perception, that is clearly not my
> > intention. I'm just trying to defend the idea that when designing this
> kind
> > of transformation, it might be interesting to have a library to test
> > several mappings and evaluate them before doing a more direct
> > implementation if the performance is not there.
> >
> > On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
> > benjaminblodg...@gmail.com> wrote:
> >
> > > He was trying to nicely say he knows way more than you, and your ideas
> > > will result in a low performance scheme no one will use in production
> > > ai/machine learning.
> > >
> > > Sent from my iPhone
> > >
> > > > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> > > benjaminblodg...@gmail.com> wrote:
> > > >
> > > > I think Jorge’s opinion has is that of an expert and him being
> humble
> > > is just being tactful.  Probably listen to Jorge on performance and
> > > architecture, even over Wes as he’s contributed more than anyone else
> and
> > > know the bleeding edge of low level performance stuff more than anyone.
> > > >
> > > > Sent from my iPhone
> > > >
> > > >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
> > laurent.que...@gmail.com>
> > > wrote:
> > > >>
> > > >> Hi Jorge
> > > >>
> > > >> I don't think that the level of in-depth knowledge needed is the
> same
> > > >> between using a row-oriented internal representation and "Arrow"
> which
> > > not
> > > >> only changes the organization of the data but also introduces a set
> of
> > > >> additional mapping choices and concepts.
> > > >>
> > > >> For example, assuming that the initial row-oriented data source is a
> > > stream
> > > >> of nested assembly of structures, lists and maps. The mapping of
> such
> > a
> > > >> stream to Protobuf, JSON, YAML, ... is straightforward because on
> both
> > > >> sides the logical representation is exactly the same, the schema is
> > > >> sometimes optional, the interest of building batches is optional,
> ...
> > In
> > > >> the case of "Arrow" things are different - the schema and the
> batching
> > > are
> > > >> mandatory. The mapping is not necessarily direct and will generally
> be
> > > the
> > > >> result of the

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Far be it from me to think that I know more than Jorge or Wes on this
subject. Sorry if my post gives that perception, that is clearly not my
intention. I'm just trying to defend the idea that when designing this kind
of transformation, it might be interesting to have a library to test
several mappings and evaluate them before doing a more direct
implementation if the performance is not there.

On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> He was trying to nicely say he knows way more than you, and your ideas
> will result in a low performance scheme no one will use in production
> ai/machine learning.
>
> Sent from my iPhone
>
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being humble
> is just being tactful.  Probably listen to Jorge on performance and
> architecture, even over Wes as he’s contributed more than anyone else and
> know the bleeding edge of low level performance stuff more than anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel 
> wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the same
> >> between using a row-oriented internal representation and "Arrow" which
> not
> >> only changes the organization of the data but also introduces a set of
> >> additional mapping choices and concepts.
> >>
> >> For example, assuming that the initial row-oriented data source is a
> stream
> >> of nested assembly of structures, lists and maps. The mapping of such a
> >> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> >> sides the logical representation is exactly the same, the schema is
> >> sometimes optional, the interest of building batches is optional, ... In
> >> the case of "Arrow" things are different - the schema and the batching
> are
> >> mandatory. The mapping is not necessarily direct and will generally be
> the
> >> result of the combination of several trade-offs (normalization vs
> >> denormalization representation, mapping influencing the compression
> rate,
> >> queryability with Arrow processors like DataFusion, ...). Note that
> some of
> >> these complexities are not intrinsically linked to the fact that the
> target
> >> format is column oriented. The ZST format (
> >> https://zed.brimdata.io/docs/formats/zst/) for example does not
> require an
> >> explicit schema definition.
> >>
> >> IMHO, having a library that allows you to easily experiment with
> different
> >> types of mapping (without having to worry about batching, dictionaries,
> >> schema definition, understanding how lists of structs are represented,
> ...)
> >> and to evaluate the results according to your specific goals has a value
> >> (especially if your criteria are compression ratio and queryability). Of
> >> course there is an overhead to such an approach. In some cases, at the
> end
> >> of the process, it will be necessary to manually perform this direct
> >> transformation between a row-oriented XYZ format and "Arrow". However,
> this
> >> effort will be done after a simple experimentation phase to avoid
> changes
> >> in the implementation of the converter which in my opinion is not so
> simple
> >> to implement with the current Arrow API.
> >>
> >> If the Arrow developer community is not interested in integrating this
> >> proposal, I plan to release two independent libraries (Go and Rust) that
> >> can be used on top of the standard "Arrow" libraries. This will have the
> >> advantage to evaluate if such an approach is able to raise interest
> among
> >> Arrow users.
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
> >>
> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>> Hi Laurent,
> >>>
> >>> I agree that there is a common pattern in converting row-based formats
> to
> >>> Arrow.
> >>>
> >>> Imho the difficult part is not to map the storage format to Arrow
> >>> specifically - it is to map the storage format to any in-memory (row-
> or
> >>> columnar- based) format, since it requires in-depth knowledge about
> the 2
> >>> formats (the so

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
an in-memory row format for
> analytics workloads)
>
> @Wes McKinney 
>
> I still think having a canonical in-memory row format (and libraries
> > to transform to and from Arrow columnar format) is a good idea — but
> > there is the risk of ending up in the tar pit of reinventing Avro.
> >
>
> afaik Avro does not have O(1) random access neither to its rows nor columns
> - records are concatenated back to back, every record's column is
> concatenated back to back within a record, and there is no indexing
> information on how to access a particular row or column. There are blocks
> of rows that reduce the cost of accessing large offsets, but imo it is far
> from the O(1) offered by Arrow (and expected by analytics workloads).
>
> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
>
> Best,
> Jorge
>
> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel 
> wrote:
>
> > Let me clarify the proposal a bit before replying to the various previous
> > feedbacks.
> >
> >
> >
> > It seems to me that the process of converting a row-oriented data source
> > (row = set of fields or something more hierarchical) into an Arrow record
> > repeatedly raises the same challenges. A developer who must perform this
> > kind of transformation is confronted with the following questions and
> > problems:
> >
> > - Understanding the Arrow API which can be challenging for complex cases
> of
> > rows representing complex objects (list of struct, struct of struct,
> ...).
> >
> > - Decide which Arrow schema(s) will correspond to your data source. In
> some
> > complex cases it can be advantageous to translate the same row-oriented
> > data source into several Arrow schemas (e.g. OpenTelementry data
> sources).
> >
> > - Decide on the encoding of the columns to make the most of the
> > column-oriented format and thus increase the compression rate (e.g.
> define
> > the columns that should be represent as dictionaries).
> >
> >
> >
> > By experience, I can attest that this process is usually iterative. For
> > non-trivial data sources, arriving at the arrow representation that
> offers
> > the best compression ratio and is still perfectly usable and queryable
> is a
> > long and tedious process.
> >
> >
> >
> > I see two approaches to ease this process and consequently increase the
> > adoption of Apache Arrow:
> >
> > - Definition of a canonical in-memory row format specification that every
> > row-oriented data source provider can progressively adopt to get an
> > automatic translation into the Arrow format.
> >
> > - Definition of an integration library allowing to map any row-oriented
> > source into a generic row-oriented source understood by the converter. It
> > is not about defining a unique in-memory format but more about defining a
> > standard API to represent row-oriented data.
> >
> >
> >
> > In my opinion these two approaches are complementary. The first option
> is a
> > long-term approach targeting directly the data providers, which will
> > require to agree on this generic row-oriented format and whose adoption
> > will be more or less long. The second approach does not directly require
> > the collaboration of data source providers but allows an "integrator" to
> > perform this transformation painlessly with potentially several
> > representation trials to achieve the best results in his context.
> >
> >
> >
> > The current proposal is an implementation of the second approach, i.e. an
> > API that maps a row-oriented source XYZ into an intermediate row-oriented
> > representation understood mechanically by the translator. This translator
> > also adds a series of optimizations to make the most of the Arrow format.
> >
> >
> >
> > You can find multiple examples of a such transformation in the following
> > examples:
> >
> >-
> >
> >
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
> >this example converts OTEL trace entities into their corresponding
> Arrow
> >IR. At the end of this conversion the method returns a collection of
> > Arrow
> >Records.
> >- A more complex example can be found here
> >
> >
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
> > .
> >In this example a stream of OTEL univariate row-oriented metrics are
> >translate into multivariate row-oriented metrics and then
> automatically
> >transl

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Laurent Quérel
for row-based data as well as column based
> > data.  There was also a recent attempt at least in C++ to try to build
> > utilities to do these pivots but it was decided that it didn't add much
> > utility (it was added a comprehensive example).
> >
> > Thanks,
> > Micah
> >
> > On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel  >
> > wrote:
> >
> > > In the context of this OTEP
> > > <
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > > >
> > > (OpenTelemetry
> > > Enhancement Proposal) I developed an integration layer on top of Apache
> > > Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> > > stream into an arrow-based columnar representation*. In this particular
> > > case the goal was to translate all OpenTelemetry entities (metrics,
> logs,
> > > or traces) into Apache Arrow records. These entities can be quite
> complex
> > > and their corresponding Arrow schema must be defined on the fly. IMO,
> this
> > > approach is not specific to my specific needs but could be used in many
> > > other contexts where there is a need to simplify the integration
> between a
> > > row-oriented source of data and Apache Arrow. The trade-off is to have
> to
> > > perform the additional step of conversion to the intermediate
> > > representation, but this transformation does not require to understand
> the
> > > arcana of the Arrow format and allows to potentially benefit from
> > > functionalities such as the encoding of the dictionary "for free", the
> > > automatic generation of Arrow schemas, the batching, the multi-column
> > > sorting, etc.
> > >
> > >
> > > I know that JSON can be used as a kind of intermediate representation
> in
> > > the context of Arrow with some language specific implementation.
> Current
> > > JSON integrations are insufficient to cover the most complex scenarios
> and
> > > are not standardized; e.g. support for most of the Arrow data type,
> various
> > > optimizations (string|binary dictionaries, multi-column sorting),
> batching,
> > > integration with Arrow IPC, compression ratio optimization, ... The
> object
> > > of this proposal is to progressively cover these gaps.
> > >
> > > I am looking to see if the community would be interested in such a
> > > contribution. Above are some additional details on the current
> > > implementation. All feedback is welcome.
> > >
> > > 10K ft overview of the current implementation:
> > >
> > >1. Developers convert their row oriented stream into records based
> on
> > >the Arrow Intermediate Representation (AIR). At this stage the
> > > translation
> > >can be quite mechanical but if needed developers can decide for
> example
> > > to
> > >translate a map into a struct if that makes sense for them. The
> current
> > >implementation support the following arrow data types: bool, all
> uints,
> > > all
> > >ints, all floats, string, binary, list of any supported types, and
> > > struct
> > >of any supported types. Additional Arrow types could be added
> > > progressively.
> > >2. The row oriented record (i.e. AIR record) is then added to a
> > >RecordRepository. This repository will first compute a schema
> signature
> > > and
> > >will route the record to a RecordBatcher based on this signature.
> > >3. The RecordBatcher is responsible for collecting all the
> compatible
> > >AIR records and, upon request, the "batcher" is able to build an
> Arrow
> > >Record representing a batch of compatible inputs. In the current
> > >implementation, the batcher is able to convert string columns to
> > > dictionary
> > >based on a configuration. Another configuration allows to evaluate
> which
> > >columns should be sorted to optimize the compression ratio. The same
> > >optimization process could be applied to binary columns.
> > >4. Steps 1 through 3 can be repeated on the same RecordRepository
> > >instance to build new sets of arrow record batches. Subsequent
> > > iterations
> > >will be slightly faster due to different techniques used (e.g.
> object
> > >reuse, dictionary reuse and sorting, ...)
> > >
> > >
> > > The current Go implementation
> > > <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently
> part of
> > > this repo (see pkg/air package). If the community is interested, I
> could do
> > > a PR in the Arrow Go and Rust sub-projects.
> > >
>


-- 
Laurent Quérel


[proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-26 Thread Laurent Quérel
In the context of this OTEP

(OpenTelemetry
Enhancement Proposal) I developed an integration layer on top of Apache
Arrow (Go an Rust) to *facilitate the translation of row-oriented data
stream into an arrow-based columnar representation*. In this particular
case the goal was to translate all OpenTelemetry entities (metrics, logs,
or traces) into Apache Arrow records. These entities can be quite complex
and their corresponding Arrow schema must be defined on the fly. IMO, this
approach is not specific to my specific needs but could be used in many
other contexts where there is a need to simplify the integration between a
row-oriented source of data and Apache Arrow. The trade-off is to have to
perform the additional step of conversion to the intermediate
representation, but this transformation does not require to understand the
arcana of the Arrow format and allows to potentially benefit from
functionalities such as the encoding of the dictionary "for free", the
automatic generation of Arrow schemas, the batching, the multi-column
sorting, etc.


I know that JSON can be used as a kind of intermediate representation in
the context of Arrow with some language specific implementation. Current
JSON integrations are insufficient to cover the most complex scenarios and
are not standardized; e.g. support for most of the Arrow data type, various
optimizations (string|binary dictionaries, multi-column sorting), batching,
integration with Arrow IPC, compression ratio optimization, ... The object
of this proposal is to progressively cover these gaps.

I am looking to see if the community would be interested in such a
contribution. Above are some additional details on the current
implementation. All feedback is welcome.

10K ft overview of the current implementation:

   1. Developers convert their row oriented stream into records based on
   the Arrow Intermediate Representation (AIR). At this stage the translation
   can be quite mechanical but if needed developers can decide for example to
   translate a map into a struct if that makes sense for them. The current
   implementation support the following arrow data types: bool, all uints, all
   ints, all floats, string, binary, list of any supported types, and struct
   of any supported types. Additional Arrow types could be added progressively.
   2. The row oriented record (i.e. AIR record) is then added to a
   RecordRepository. This repository will first compute a schema signature and
   will route the record to a RecordBatcher based on this signature.
   3. The RecordBatcher is responsible for collecting all the compatible
   AIR records and, upon request, the "batcher" is able to build an Arrow
   Record representing a batch of compatible inputs. In the current
   implementation, the batcher is able to convert string columns to dictionary
   based on a configuration. Another configuration allows to evaluate which
   columns should be sorted to optimize the compression ratio. The same
   optimization process could be applied to binary columns.
   4. Steps 1 through 3 can be repeated on the same RecordRepository
   instance to build new sets of arrow record batches. Subsequent iterations
   will be slightly faster due to different techniques used (e.g. object
   reuse, dictionary reuse and sorting, ...)


The current Go implementation
 (WIP) is currently part of
this repo (see pkg/air package). If the community is interested, I could do
a PR in the Arrow Go and Rust sub-projects.