Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Micah Kornfield
Please note this message and the previous one from the author violate our Code
of Conduct [1].  Specifically "Do not insult or put down other
participants."  Please try to be professional in communications and focus
on the technical issues at hand.

[1] https://www.apache.org/foundation/policies/conduct.html



On Thu, Jul 28, 2022 at 12:16 PM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> He was trying to nicely say he knows way more than you, and your ideas
> will result in a low performance scheme no one will use in production
> ai/machine learning.
>
> Sent from my iPhone
>
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being humble
> is just being tactful.  Probably listen to Jorge on performance and
> architecture, even over Wes as he’s contributed more than anyone else and
> know the bleeding edge of low level performance stuff more than anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel 
> wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the same
> >> between using a row-oriented internal representation and "Arrow" which
> not
> >> only changes the organization of the data but also introduces a set of
> >> additional mapping choices and concepts.
> >>
> >> For example, assuming that the initial row-oriented data source is a
> stream
> >> of nested assembly of structures, lists and maps. The mapping of such a
> >> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> >> sides the logical representation is exactly the same, the schema is
> >> sometimes optional, the interest of building batches is optional, ... In
> >> the case of "Arrow" things are different - the schema and the batching
> are
> >> mandatory. The mapping is not necessarily direct and will generally be
> the
> >> result of the combination of several trade-offs (normalization vs
> >> denormalization representation, mapping influencing the compression
> rate,
> >> queryability with Arrow processors like DataFusion, ...). Note that
> some of
> >> these complexities are not intrinsically linked to the fact that the
> target
> >> format is column oriented. The ZST format (
> >> https://zed.brimdata.io/docs/formats/zst/) for example does not
> require an
> >> explicit schema definition.
> >>
> >> IMHO, having a library that allows you to easily experiment with
> different
> >> types of mapping (without having to worry about batching, dictionaries,
> >> schema definition, understanding how lists of structs are represented,
> ...)
> >> and to evaluate the results according to your specific goals has a value
> >> (especially if your criteria are compression ratio and queryability). Of
> >> course there is an overhead to such an approach. In some cases, at the
> end
> >> of the process, it will be necessary to manually perform this direct
> >> transformation between a row-oriented XYZ format and "Arrow". However,
> this
> >> effort will be done after a simple experimentation phase to avoid
> changes
> >> in the implementation of the converter which in my opinion is not so
> simple
> >> to implement with the current Arrow API.
> >>
> >> If the Arrow developer community is not interested in integrating this
> >> proposal, I plan to release two independent libraries (Go and Rust) that
> >> can be used on top of the standard "Arrow" libraries. This will have the
> >> advantage to evaluate if such an approach is able to raise interest
> among
> >> Arrow users.
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
> >>
> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>> Hi Laurent,
> >>>
> >>> I agree that there is a common pattern in converting row-based formats
> to
> >>> Arrow.
> >>>
> >>> Imho the difficult part is not to map the storage format to Arrow
> >>> specifically - it is to map the storage format to any in-memory (row-
> or
> >>> columnar- based) format, since it requires in-depth knowledge about
> the 2
> >>> formats (the source format and the target format).
> >>>
> >>> - Understanding the Arrow API which can be challenging for complex
> cases of
>  rows representing complex objects (list of struct, struct of struct,
> >>> ...).
> 
> >>>
> >>> the developer would have the same problem - just shifted around - they
> now
> >>> need to convert their complex objects to the intermediary
> representation.
> >>> Whether it is more "difficult" or "complex" to learn than Arrow is an
> open
> >>> question, but we would essentially be shifting the problem from
> "learning
> >>> Arrow" to "learning the Intermediate in-memory".
> >>>
> >>> @Micah Kornfield, as described before my goal is not to define a memory
>  layout specification but more to define an API and a translation
> >>> mechanism
>  able to take this intermediate representation (list of generic o

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-07-29 Thread Sasha Krassovsky
Hi,
I’ve also had quite a few thoughts on this, as it is somewhat strange at the 
moment (within the context of Acero at least) that e.g. IntegerDictionary is 
not the same type as an Integer, meaning that we have to manually cast between 
the two or reject any operation that mixes the two. I was mucking around with 
implementing my own Arrow type system (just to see how I’d design myself) and 
came up with a “three-level” type system. 

Specifically we have: 
- Logical type: is it an int, float, decimal, timestamp, struct, utf8, etc. A 
schema only specifies the logical types of fields.
- Physical type: a physical instantiation of a logical type. This would 
parameterize the logical type with things like bit widths, precision, timestamp 
units, offset size, etc. Every element within an array must have the same 
physical type, but batches with different physical types may conform to the 
same schema. 
- Array type: the physical of the arrays themselves, i.e. how many buffers, 
what each buffer represents, etc. This also includes stuff like RLE and 
Dictionary array types, and this is where other encodings would go. RLE for 
example just has a `run_lengths` buffer and a child array. Similar with 
dictionary. This is a very powerful way of composing encodings, as you could 
now have an RLE buffer with a child dictionary buffer which itself has a child 
RLE buffer (or something like that).

The warts appear when some logical types are amenable to some encodings and 
others are not: bit packing, delta, and FOR for example work for integers but 
not strings, but I think this can be worked fairly easily.  

Now, Arrow currently specifies physical types directly in the schema, which is 
fine, I think that’s how most database type systems work. However, given that 
Dict is a different type from Int8, it seems that Arrow conflates the 
bottom two levels. I think what we really need to do is refactor the type 
system to separate out the array type from the physical type. Schemas will deal 
in physical types and actual materialized batches will have an associated array 
type. 

Let me know if I need to clarify anything, that was a lot of text :) 

Sasha Krassovsky

> On Jul 29, 2022, at 4:18 PM, Wes McKinney  wrote:
> 
> of the implementation when it comes to the IPC format and the C
> interface.
> 



[DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-07-29 Thread Wes McKinney
hi all,

Since we've been recently discussing adding new data types, memory
formats, or data encodings to Arrow, I wanted to bring up a more "big
picture" question around how we could support data whose encodings may
change throughout the lifetime of a data stream sent via the IPC
format (e.g. over Flight) or over the C data interface.

I can think of a few common encodings which could appear dynamically
in a stream, and apply to basically all Arrow data types:

* Constant (value is the same for all records in a batch)
* Dictionary
* Run-length encoded
* Plain

There are some other encodings that work only for certain data types
(e.g. FrameOfReference for integers).

Current Arrow record batches can either be all Plain encoded or all
Dictionary encoded, but the decision must be declared up front in the
schema. The dictionary can change, but the stream cannot stop
dictionary encoding.

In Parquet files, many writers will start out all columns using
dictionary encoding and then switch to plain encoding if the
dictionary exceeds a certain size. This has led to a certain
awkwardness when trying to return dictionary encoded data directly
from a Parquet file, since the "switchover" to Plain encoding is not
compatible with the way that Arrow schemas work.

In general, it's not practical for all data sources to know up front
what is the "best" encoding, so being able to switch from one encoding
to another would give Arrow producers more flexibility in their
choice.

Micah Kornfield had previously put up a PR to add a new RecordBatch
metadata variant for the IPC format that would permit dynamic
encodings as well as sparseness (fields not present in the batch --
effectively "all null" -- currently "all null" fields in record
batches take up a lot of useless space)

https://github.com/apache/arrow/pull/4815

I think given the discussions that have been happening in and around
the project, that now would be a good time to rekindle this discussion
and see if we can come up with something that will work with the above
listed encodings and also provide for the beneficial sparseness
property. It is also timely since there are several PRs for RLE that
Tobias Zagorni has been working on [1], and knowing how new encodings
could be added to Arrow in general will have some bearing on the same
of the implementation when it comes to the IPC format and the C
interface.

For the Arrow C ABI, I am not sure about whether sparseness could be
supported, but finding a mechanism to transmit dynamically-encoded
data without breaking the existing C ABI would be worthwhile also.

Thanks,
Wes

[1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle


[VOTE] Release Apache Arrow 9.0.0 - RC2

2022-07-29 Thread Krisztián Szűcs
Hi,

I would like to propose the following release candidate (RC2) of Apache
Arrow version 9.0.0. This is a release consisting of 507
resolved JIRA issues[1].

This release candidate is based on commit:
ea6875fd2a3ac66547a9a33c5506da94f3ff07f2 [2]

The source release rc2 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 9.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 9.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%209.0.0
[2]: 
https://github.com/apache/arrow/tree/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-9.0.0-rc2
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/9.0.0-rc2
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/9.0.0-rc2
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/9.0.0-rc2
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
I think either path:

* Canonical extension type
* First-class type in the Type union in Flatbuffers

would be OK. The canonical extension type option is the preferable
path here, I think, because it allows Arrow implementations without
any special handling for JSON to allow the data to pass through as
Binary or String. Implementations like C++ could see the extension
type metadata and construct an instance of arrow::Type::JSON /
JsonArray, etc., but when it gets serialized back to Parquet or Arrow
IPC it looks like binary/string (since JSON can be utf-16/utf-32,
right?) with additional field metadata.

On Fri, Jul 29, 2022 at 5:56 PM Pradeep Gollakota
 wrote:
>
> Thanks Micah!
>
> That's certainly one option we could use. It would likely be easier to
> implement at the outset. I wonder if something like arrow::json() would
> open up more options down the line.
>
> This brings up an interesting question of whether Parquet logical types
> should have a 1:1 mapping with Arrow logical types. Would we also want an
> arrow::bson()? I wouldn't think so. Maybe
> arrow::json({encoding=string/bson})? I'm not sure which would be better if
> we want to enable compute engines to manipulate the JSON data.
>
> On Fri, Jul 29, 2022 at 6:38 PM Micah Kornfield 
> wrote:
>
> > Just to be clear, I think we are referring to a "well known"/canonical
> > extension type [1] here?   I'd also be in favor of this (Disclaimer I'm a
> > colleague of Padeep's)
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> >
> > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney  wrote:
> >
> > > This seems like a common-enough data type that having a first-class
> > > logical type would be a good idea (perhaps even more so than UUID!).
> > > Compute engines would be able to implement kernels that provide
> > > manipulations of JSON data similar to what you can do with jq or
> > > GraphQL.
> > >
> > > On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
> > >  wrote:
> > > >
> > > > Hi Team!
> > > >
> > > > I filed ARROW-17255 to support the JSON logical type in Arrow.
> > Initially
> > > > I'm only interested in C++ support that wraps a string. I imagine that
> > as
> > > > Arrow and Parquet get more sophisticated, we might want to do more
> > > > interesting things (shredding?) with the JSON.
> > > >
> > > > David mentioned that there have been discussions around other "common"
> > > > extensions like UUID. Is this something that the community would be
> > > > interested in? My goal at the moment is to be able to export data from
> > > > BigQuery to Parquet with the correct LogicalType set in the exported
> > > files.
> > > >
> > > > Thanks!
> > > > Pradeep
> > >
> >
>
>
> --
> Pradeep


Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Micah Kornfield
Just to be clear, I think we are referring to a "well known"/canonical
extension type [1] here?   I'd also be in favor of this (Disclaimer I'm a
colleague of Padeep's)

[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types


On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney  wrote:

> This seems like a common-enough data type that having a first-class
> logical type would be a good idea (perhaps even more so than UUID!).
> Compute engines would be able to implement kernels that provide
> manipulations of JSON data similar to what you can do with jq or
> GraphQL.
>
> On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
>  wrote:
> >
> > Hi Team!
> >
> > I filed ARROW-17255 to support the JSON logical type in Arrow. Initially
> > I'm only interested in C++ support that wraps a string. I imagine that as
> > Arrow and Parquet get more sophisticated, we might want to do more
> > interesting things (shredding?) with the JSON.
> >
> > David mentioned that there have been discussions around other "common"
> > extensions like UUID. Is this something that the community would be
> > interested in? My goal at the moment is to be able to export data from
> > BigQuery to Parquet with the correct LogicalType set in the exported
> files.
> >
> > Thanks!
> > Pradeep
>


Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
This seems like a common-enough data type that having a first-class
logical type would be a good idea (perhaps even more so than UUID!).
Compute engines would be able to implement kernels that provide
manipulations of JSON data similar to what you can do with jq or
GraphQL.

On Fri, Jul 29, 2022 at 1:43 PM Pradeep Gollakota
 wrote:
>
> Hi Team!
>
> I filed ARROW-17255 to support the JSON logical type in Arrow. Initially
> I'm only interested in C++ support that wraps a string. I imagine that as
> Arrow and Parquet get more sophisticated, we might want to do more
> interesting things (shredding?) with the JSON.
>
> David mentioned that there have been discussions around other "common"
> extensions like UUID. Is this something that the community would be
> interested in? My goal at the moment is to be able to export data from
> BigQuery to Parquet with the correct LogicalType set in the exported files.
>
> Thanks!
> Pradeep


Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
(Nvm the libre2 error, It was my mistake)

On Fri, Jul 29, 2022 at 4:49 PM Li Jin  wrote:

> Also, if it is the google re2, is there a minimum version required?
> Currently my system has re2 from 20201101.
>
> On Fri, Jul 29, 2022 at 4:45 PM Li Jin  wrote:
>
>> Thanks David!
>>
>> I used the code in the sql flight Cmakelist. Unfortunately I hit another
>> error, I wonder if you happen to know a quick fix for this? (I don't
>> know about libre2, is it https://github.com/google/re2 or sth else?)
>>
>> "libre2.so: cannot open shared object file: No such file or directory"
>>
>> On Fri, Jul 29, 2022 at 4:09 PM David Li  wrote:
>>
>>> You'll also need to link to arrow_flight (and ditto for other libraries
>>> you may want to use).
>>>
>>> Note that due to ARROW-12175 you may need a bit of finagling if you're
>>> using CMake as your build system [1]. You can see a small workaround at [2].
>>>
>>> [1]: https://issues.apache.org/jira/browse/ARROW-12175
>>> [2]:
>>> https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31
>>>
>>> -David
>>>
>>> On Fri, Jul 29, 2022, at 15:53, Li Jin wrote:
>>> > (This is with Arrow 7.0.0)
>>> >
>>> > On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:
>>> >
>>> >> Hi!
>>> >>
>>> >> I saw this error when linking my code against arrow flight and
>>> suspect I
>>> >> didn't write my cmake correctly:
>>> >>
>>> >> "error: undefined reference to arrow::flight::Location::Location()"
>>> >>
>>> >> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake
>>> and
>>> >> linked my executable with arrow_shared. Is that enough to link arrow
>>> flight
>>> >> or do I need to do sth else?
>>> >>
>>> >> Li
>>> >>
>>>
>>


Re: CMake dependencies for arrow flight

2022-07-29 Thread David Li
Not sure what specifically is causing that error in your case, sorry. RE2 is 
indeed the regex engine. Arrow appears to select a newer version by default [1] 
but I'm not sure if this is *required*. However, the error indicates that the 
library just isn't there, or at least can't be found at runtime - so you may 
want to play with your LD_LIBRARY_PATH/ensure your Conda environment is 
active/etc. as appropriate.

[1]: https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt#L80

-David

On Fri, Jul 29, 2022, at 16:49, Li Jin wrote:
> Also, if it is the google re2, is there a minimum version required?
> Currently my system has re2 from 20201101.
>
> On Fri, Jul 29, 2022 at 4:45 PM Li Jin  wrote:
>
>> Thanks David!
>>
>> I used the code in the sql flight Cmakelist. Unfortunately I hit another
>> error, I wonder if you happen to know a quick fix for this? (I don't
>> know about libre2, is it https://github.com/google/re2 or sth else?)
>>
>> "libre2.so: cannot open shared object file: No such file or directory"
>>
>> On Fri, Jul 29, 2022 at 4:09 PM David Li  wrote:
>>
>>> You'll also need to link to arrow_flight (and ditto for other libraries
>>> you may want to use).
>>>
>>> Note that due to ARROW-12175 you may need a bit of finagling if you're
>>> using CMake as your build system [1]. You can see a small workaround at [2].
>>>
>>> [1]: https://issues.apache.org/jira/browse/ARROW-12175
>>> [2]:
>>> https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31
>>>
>>> -David
>>>
>>> On Fri, Jul 29, 2022, at 15:53, Li Jin wrote:
>>> > (This is with Arrow 7.0.0)
>>> >
>>> > On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:
>>> >
>>> >> Hi!
>>> >>
>>> >> I saw this error when linking my code against arrow flight and suspect
>>> I
>>> >> didn't write my cmake correctly:
>>> >>
>>> >> "error: undefined reference to arrow::flight::Location::Location()"
>>> >>
>>> >> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake
>>> and
>>> >> linked my executable with arrow_shared. Is that enough to link arrow
>>> flight
>>> >> or do I need to do sth else?
>>> >>
>>> >> Li
>>> >>
>>>
>>


Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
Also, if it is the google re2, is there a minimum version required?
Currently my system has re2 from 20201101.

On Fri, Jul 29, 2022 at 4:45 PM Li Jin  wrote:

> Thanks David!
>
> I used the code in the sql flight Cmakelist. Unfortunately I hit another
> error, I wonder if you happen to know a quick fix for this? (I don't
> know about libre2, is it https://github.com/google/re2 or sth else?)
>
> "libre2.so: cannot open shared object file: No such file or directory"
>
> On Fri, Jul 29, 2022 at 4:09 PM David Li  wrote:
>
>> You'll also need to link to arrow_flight (and ditto for other libraries
>> you may want to use).
>>
>> Note that due to ARROW-12175 you may need a bit of finagling if you're
>> using CMake as your build system [1]. You can see a small workaround at [2].
>>
>> [1]: https://issues.apache.org/jira/browse/ARROW-12175
>> [2]:
>> https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31
>>
>> -David
>>
>> On Fri, Jul 29, 2022, at 15:53, Li Jin wrote:
>> > (This is with Arrow 7.0.0)
>> >
>> > On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:
>> >
>> >> Hi!
>> >>
>> >> I saw this error when linking my code against arrow flight and suspect
>> I
>> >> didn't write my cmake correctly:
>> >>
>> >> "error: undefined reference to arrow::flight::Location::Location()"
>> >>
>> >> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake
>> and
>> >> linked my executable with arrow_shared. Is that enough to link arrow
>> flight
>> >> or do I need to do sth else?
>> >>
>> >> Li
>> >>
>>
>


Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
Thanks David!

I used the code in the sql flight Cmakelist. Unfortunately I hit another
error, I wonder if you happen to know a quick fix for this? (I don't
know about libre2, is it https://github.com/google/re2 or sth else?)

"libre2.so: cannot open shared object file: No such file or directory"

On Fri, Jul 29, 2022 at 4:09 PM David Li  wrote:

> You'll also need to link to arrow_flight (and ditto for other libraries
> you may want to use).
>
> Note that due to ARROW-12175 you may need a bit of finagling if you're
> using CMake as your build system [1]. You can see a small workaround at [2].
>
> [1]: https://issues.apache.org/jira/browse/ARROW-12175
> [2]:
> https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31
>
> -David
>
> On Fri, Jul 29, 2022, at 15:53, Li Jin wrote:
> > (This is with Arrow 7.0.0)
> >
> > On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:
> >
> >> Hi!
> >>
> >> I saw this error when linking my code against arrow flight and suspect I
> >> didn't write my cmake correctly:
> >>
> >> "error: undefined reference to arrow::flight::Location::Location()"
> >>
> >> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake
> and
> >> linked my executable with arrow_shared. Is that enough to link arrow
> flight
> >> or do I need to do sth else?
> >>
> >> Li
> >>
>


Re: CMake dependencies for arrow flight

2022-07-29 Thread David Li
You'll also need to link to arrow_flight (and ditto for other libraries you may 
want to use).

Note that due to ARROW-12175 you may need a bit of finagling if you're using 
CMake as your build system [1]. You can see a small workaround at [2].

[1]: https://issues.apache.org/jira/browse/ARROW-12175
[2]: 
https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31

-David

On Fri, Jul 29, 2022, at 15:53, Li Jin wrote:
> (This is with Arrow 7.0.0)
>
> On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:
>
>> Hi!
>>
>> I saw this error when linking my code against arrow flight and suspect I
>> didn't write my cmake correctly:
>>
>> "error: undefined reference to arrow::flight::Location::Location()"
>>
>> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake and
>> linked my executable with arrow_shared. Is that enough to link arrow flight
>> or do I need to do sth else?
>>
>> Li
>>


Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
(This is with Arrow 7.0.0)

On Fri, Jul 29, 2022 at 3:52 PM Li Jin  wrote:

> Hi!
>
> I saw this error when linking my code against arrow flight and suspect I
> didn't write my cmake correctly:
>
> "error: undefined reference to arrow::flight::Location::Location()"
>
> I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake and
> linked my executable with arrow_shared. Is that enough to link arrow flight
> or do I need to do sth else?
>
> Li
>


CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
Hi!

I saw this error when linking my code against arrow flight and suspect I
didn't write my cmake correctly:

"error: undefined reference to arrow::flight::Location::Location()"

I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake and
linked my executable with arrow_shared. Is that enough to link arrow flight
or do I need to do sth else?

Li


Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Gavin Ray
> there are scalar api functions that can be logically used to process rows
of data, but they are executed on columnar batches of data.
> As mentioned previously it is better to have an API that applies row
level transformations than to have an intermediary row level memory format.

Another way of thinking about this maybe is that the API would be something
of a "Row-based Facade" over underlying columnar memory, right?

As an end-user, for instance, I probably don't mind much about what happens
under the hood.
On the surface, I'd just like to be able to mentally work with rows and be
able to load data in the shape of "Map<>", and "Collection", etc

Given the disclaimer that it'd be more efficient not to start from the
row-based data (IE, in your JDBC ResultSet processing, construct Arrow
results directly instead of serializing to List)
But if you already have data in this shape, maybe from an API you don't
control, then having a row-based facade over columnar API's would be really
convenient and ergonomic.

That's my $0.02 anyways


On Fri, Jul 29, 2022 at 9:56 AM Lee, David 
wrote:

> In pyarrow.compute which is an extension of the C++ implementation there
> are scalar api functions that can be logically used to process rows of
> data, but they are executed on columnar batches of data.
>
> As mentioned previously it is better to have an API that applies row level
> transformations than to have an intermediary row level memory format.
>
> Sent from my iPad
>
> > On Jul 29, 2022, at 3:43 AM, Andrew Lamb  wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I am +0 on a standard API -- in the Rust arrow-rs implementation we tend
> to
> > borrow inspiration from the C++ / Java interfaces and then create
> > appropriate Rust APIs.
> >
> > There is also a row based format in DataFusion [1] (Rust) and it is used
> to
> > implement certain GroupBy and Sorts (similarly to what Sasha Krassovsky
> > describes for Acero).
> >
> > I think row based formats are common in vectorized query engines for
> > operations that can't be easily vectorized (sorts, groups and joins),
> > though I am not sure how reusable those formats would be
> >
> > There are at least three uses that require slightly different layouts
> > 1. Comparing row formatted data for equality (where space efficiency is
> > important)
> > 2. Comparing row formatted data for comparisons (where collation is
> > important)
> > 3. Using row formatted data to hold intermediate aggregates (where word
> > alignment is important)
> >
> > So in other words, I am not sure how easy it would be to define a common
> > in-memory layout for rows.
> >
> > Andrew
> >
> > [1]
> >
> https://urldefense.com/v3/__https://github.com/apache/arrow-datafusion/blob/3cd62e9/datafusion/row/src/layout.rs*L29-L75__;Iw!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q1tHhSXS$
> >
> >
> >
> >> On Fri, Jul 29, 2022 at 2:06 AM Laurent Quérel <
> laurent.que...@gmail.com>
> >> wrote:
> >>
> >> Hi Sasha,
> >> Thank you very much for this informative comment. It's interesting to
> see
> >> another use of a row-based API in the context of a query engine. I think
> >> that there is some thought to be given to whether or not it is possible
> to
> >> converge these two use cases into a single public row-based API.
> >>
> >> As a first reaction I would say that it is not necessarily easy to
> >> reconcile because the constraints and the goals to be optimized are
> >> relatively disjoint. If you see a way to do it I'm extremely interested.
> >>
> >> If I understand correctly, in your case, you want to optimize the
> >> conversion from column to row representation and vice versa (a kind of
> >> bidirectional projection). Having a SIMD implementation of these
> >> conversions is just fantastic. However it seems that in your case there
> is
> >> no support for nested types yet and I feel like there is no public API
> to
> >> build rows in a simple and ergonomic way outside this bridge with the
> >> column-based representation.
> >>
> >> In the use case I'm trying to solve, the criteria to optimize are 1)
> expose
> >> a row-based API that offers the least amount of friction in the process
> of
> >> converting any row-based source to Arrow, which implies an easy-to-use
> API
> >> and support for nested types, 2) make it easy to create an efficient
> Arrow
> >> schema by automating dictionary creation and multi-column sorting in a
> way
> >> that makes Arrow easy to use for the casual user.
> >>
> >> The criteria to be optimized seem relatively disjointed to me but again
> I
> >> would be willing to dig with you a solution that offers a good
> compromise
> >> for these two use cases.
> >>
> >> Best,
> >> Laurent
> >>
> >>
> >>
> >> On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <
> >> krassovskysa...@gmail.com>
> >> wrote:
> >>
> >>> Hi everyone,
> >>> I just wanted to chime in that we already do have a form of
> row-oriented
> >>>

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Lee, David
In pyarrow.compute which is an extension of the C++ implementation there are 
scalar api functions that can be logically used to process rows of data, but 
they are executed on columnar batches of data.

As mentioned previously it is better to have an API that applies row level 
transformations than to have an intermediary row level memory format.

Sent from my iPad

> On Jul 29, 2022, at 3:43 AM, Andrew Lamb  wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> I am +0 on a standard API -- in the Rust arrow-rs implementation we tend to
> borrow inspiration from the C++ / Java interfaces and then create
> appropriate Rust APIs.
> 
> There is also a row based format in DataFusion [1] (Rust) and it is used to
> implement certain GroupBy and Sorts (similarly to what Sasha Krassovsky
> describes for Acero).
> 
> I think row based formats are common in vectorized query engines for
> operations that can't be easily vectorized (sorts, groups and joins),
> though I am not sure how reusable those formats would be
> 
> There are at least three uses that require slightly different layouts
> 1. Comparing row formatted data for equality (where space efficiency is
> important)
> 2. Comparing row formatted data for comparisons (where collation is
> important)
> 3. Using row formatted data to hold intermediate aggregates (where word
> alignment is important)
> 
> So in other words, I am not sure how easy it would be to define a common
> in-memory layout for rows.
> 
> Andrew
> 
> [1]
> https://urldefense.com/v3/__https://github.com/apache/arrow-datafusion/blob/3cd62e9/datafusion/row/src/layout.rs*L29-L75__;Iw!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q1tHhSXS$
> 
> 
> 
>> On Fri, Jul 29, 2022 at 2:06 AM Laurent Quérel 
>> wrote:
>> 
>> Hi Sasha,
>> Thank you very much for this informative comment. It's interesting to see
>> another use of a row-based API in the context of a query engine. I think
>> that there is some thought to be given to whether or not it is possible to
>> converge these two use cases into a single public row-based API.
>> 
>> As a first reaction I would say that it is not necessarily easy to
>> reconcile because the constraints and the goals to be optimized are
>> relatively disjoint. If you see a way to do it I'm extremely interested.
>> 
>> If I understand correctly, in your case, you want to optimize the
>> conversion from column to row representation and vice versa (a kind of
>> bidirectional projection). Having a SIMD implementation of these
>> conversions is just fantastic. However it seems that in your case there is
>> no support for nested types yet and I feel like there is no public API to
>> build rows in a simple and ergonomic way outside this bridge with the
>> column-based representation.
>> 
>> In the use case I'm trying to solve, the criteria to optimize are 1) expose
>> a row-based API that offers the least amount of friction in the process of
>> converting any row-based source to Arrow, which implies an easy-to-use API
>> and support for nested types, 2) make it easy to create an efficient Arrow
>> schema by automating dictionary creation and multi-column sorting in a way
>> that makes Arrow easy to use for the casual user.
>> 
>> The criteria to be optimized seem relatively disjointed to me but again I
>> would be willing to dig with you a solution that offers a good compromise
>> for these two use cases.
>> 
>> Best,
>> Laurent
>> 
>> 
>> 
>> On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <
>> krassovskysa...@gmail.com>
>> wrote:
>> 
>>> Hi everyone,
>>> I just wanted to chime in that we already do have a form of row-oriented
>>> storage inside of `arrow/compute/row/row_internal.h`. It is used to store
>>> rows inside of GroupBy and Join within Acero. We also have utilities for
>>> converting to/from columnar storage (and AVX2 implementations of these
>>> conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
>>> useful to standardize this row-oriented format?
>>> 
>>> As far as I understand fixed-width rows would be trivially convertible
>>> into this representation (just a pointer to your array of structs), while
>>> variable-width rows would need a little bit of massaging (though not too
>>> much) to be put into this representation.
>>> 
>>> Sasha Krassovsky
>>> 
 On Jul 28, 2022, at 1:10 PM, Laurent Quérel 
>>> wrote:
 
 Thank you Micah for a very clear summary of the intent behind this
 proposal. Indeed, I think that clarifying from the beginning that this
 approach aims at facilitating experimentation more than efficiency in
>>> terms
 of performance of the transformation phase would have helped to better
 understand my objective.
 
 Regarding your question, I don't think there is a specific technical
>>> reason
 for such an integration in the core library. I was just thinking that
>> it
 would make this infrastructure easier to find for the us

Re: [Flight][Java][JDBC] IP clearance of Flight JDBC Driver

2022-07-29 Thread David Li
The vote/form are now done [1]. (There were a few points of clarification 
required.)

Up next: I will merge the PR into the branch, then merge master into the 
branch. After that we can open a final mega PR for review/merge into master.

[1]: https://lists.apache.org/thread/fjd4942rlccpcjj0cpz0obcpjqhwobtq

On Tue, Jul 12, 2022, at 12:13, James Duong wrote:
> Hi David,
>
> The Software Grant has been filled and sent to secret...@apache.org.
>
> Thanks for helping to move this along.
>
> On Tue, Jun 28, 2022 at 1:17 PM David Li  wrote:
>
>> It appears everyone has submitted an individual CLA.
>>
>> I have committed the outline of the IP clearance form. [1] I think we can
>> kick off the vote, too.
>>
>> As per the form, please remember to ensure that a Corporate CLA is
>> recorded if such is required to authorize their contributions under their
>> individual CLA.
>>
>> James, can you get the Software Grant filled? [2]
>>
>> [1]:
>> https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-flight-sql-jdbc-driver.xml
>> [2]: https://www.apache.org/licenses/contributor-agreements.html#grants
>>
>> -David
>>
>> On Thu, Apr 7, 2022, at 17:21, David Li wrote:
>> > Thanks James.
>> >
>> > For the CLAs: for Ballista at least it was deemed OK [1][2] since all
>> > remaining contributions were easily replaceable and the project was
>> > always clearly Apache licensed. Likely that applies here too. We can
>> > look over things and I can go find out who we can reach out to to
>> > confirm.
>> >
>> > Also the list of people may include too many people, since (at least
>> > spot checking) one person has only one commit and it appears to be a
>> > long-superseded Flight SQL commit, so it's possible we could prune the
>> > history
>> >
>> > [1]: https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
>> > [2]: https://lists.apache.org/thread/khk8b06t5wrsg2xmprcqglb7dl76x20r
>> >
>> > On Thu, Apr 7, 2022, at 17:02, James Duong wrote:
>> >> I have started a PR for merging to the new branch here:
>> >> https://github.com/apache/arrow/pull/12830
>> >>
>> >> Regarding individual CLAs, are these necessary for people that are no
>> >> longer at Dremio?
>> >>
>> >>
>> >> On Mon, Apr 4, 2022 at 11:52 AM Wes McKinney 
>> wrote:
>> >>
>> >>> A corporate CLA is not required. Individual CLAs are fine.
>> >>>
>> >>> Since Dremio is a US corporation and the IP for the JDBC driver is
>> >>> owned by Dremio (I assume that the contributors all have IP assignment
>> >>> agreements where their contributions are assigned to the corporation),
>> >>> it would be best to have a Software Grant. Dremio previously submitted
>> >>> a Software Grant for Gandiva.
>> >>>
>> >>> On Thu, Mar 31, 2022 at 10:05 PM Sutou Kouhei 
>> wrote:
>> >>> >
>> >>> > Hi,
>> >>> >
>> >>> > > - We submit a grant [3]. I believe James & co. do this
>> >>> > >   (this is step 3/4 in [1]) - is this correct, @Kou?
>> >>> > >   (Since you recently handled Julia.) And then we commit a
>> >>> > >   tarball in the incubator drop area (though, I don't
>> >>> > >   quite see how to do this, need to dig around)
>> >>> >
>> >>> > Oh, I didn't do this for Julia.
>> >>> > I also didn't do this when I donated GLib and Ruby:
>> >>> >
>> >>> >   *
>> https://incubator.apache.org/ip-clearance/arrow-ruby-library.html
>> >>> >   *
>> https://incubator.apache.org/ip-clearance/arrow-parquet-glib.html
>> >>> >   *
>> https://incubator.apache.org/ip-clearance/arrow-parquet-ruby.html
>> >>> >
>> >>> > I just collected individual CLAs for Julia and file an
>> >>> > individual CLA for me for GLib and Ruby.
>> >>> >
>> >>> > I also didn't commit a tarball. I just use pull requests in
>> >>> > GitHub or GitHub repository + commit ID.
>> >>> >
>> >>> >
>> >>> > Other items look OK to me.
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> > --
>> >>> > kou
>> >>> >
>> >>> > In 
>> >>> >   "Re: [Flight][Java][JDBC] IP clearance of Flight JDBC Driver" on
>> Wed,
>> >>> 30 Mar 2022 17:26:29 -0400,
>> >>> >   "David Li"  wrote:
>> >>> >
>> >>> > > So the process is at [1].
>> >>> > >
>> >>> > > I think the following needs to happen:
>> >>> > >
>> >>> > > - James, can you create a PR with the current state of the driver,
>> but
>> >>> targeted against the `flight-sql-jdbc` branch [2]? Please make sure to
>> >>> update files to have the Apache license preamble. We'll use this as the
>> >>> subject of the clearance process
>> >>> > > - All contributors need to file an individual CLA; also, if
>> necessary
>> >>> for Dremio, a Corporate CLA (I'll check for the individual CLAs soon,
>> and
>> >>> make a checklist on the PR)
>> >>> > > - I will fill out the outline form linked above (and validate
>> licenses
>> >>> of dependencies, etc)
>> >>> > > - We can hold the Arrow vote for acceptance
>> >>> > > - We submit a grant [3]. I believe James & co. do this (this is
>> step
>> >>> 3/4 in [1]) - is this correct, @Kou? (Since you recently handled
>> Julia.)
>> >>> And then we commit a tar

[DISCUSS] [RUST] object_store release planning / schedule

2022-07-29 Thread Andrew Lamb
Hi,

We have completed IP clearance, code merge, and CI for the Rust object
store implementation. One final unresolved discussion is when to release
new versions.

I would like invite anyone with an opinion to discuss the proposal for a
new release on [1]

More details on the progress of the object_store integration can be found
in [2]

Thank you,
Andrew

[1] https://github.com/apache/arrow-rs/issues/2180
[2] https://github.com/apache/arrow-rs/issues/2030


Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
There has been a substantial amount of effort put into the arrow-rs Rust
Parquet implementation to handle the corner cases of nested structs and
list, and all the fun of various levels of nullability.

Do let us know if you happen to try writing nested structures directly to
parquet and have issues.

Andrew

On Wed, Jul 27, 2022 at 6:56 PM Lee, David 
wrote:

> I think this has been addressed for both Parquet and Python to handle
> records including nested structures. Not sure about Rust and Go..
>
> [C++][Parquet] Read and write nested Parquet data with a mix of struct and
> list nesting levels
>
> https://issues.apache.org/jira/browse/ARROW-1644
>
> [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert
> list of records
>
>
> https://issues.apache.org/jira/browse/ARROW-6001?focusedCommentId=16891152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16891152
>
>
> -Original Message-
> From: Laurent Quérel 
> Sent: Tuesday, July 26, 2022 2:25 PM
> To: dev@arrow.apache.org
> Subject: [proposal] Arrow Intermediate Representation to facilitate the
> transformation of row-oriented data sources into Arrow columnar
> representation
>
> External Email: Use caution with links and attachments
>
>
> In the context of this OTEP
> <
> https://urldefense.com/v3/__https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldz_4ga_8$
> > (OpenTelemetry Enhancement Proposal) I developed an integration layer on
> top of Apache Arrow (Go an Rust) to *facilitate the translation of
> row-oriented data stream into an arrow-based columnar representation*. In
> this particular case the goal was to translate all OpenTelemetry entities
> (metrics, logs, or traces) into Apache Arrow records. These entities can be
> quite complex and their corresponding Arrow schema must be defined on the
> fly. IMO, this approach is not specific to my specific needs but could be
> used in many other contexts where there is a need to simplify the
> integration between a row-oriented source of data and Apache Arrow. The
> trade-off is to have to perform the additional step of conversion to the
> intermediate representation, but this transformation does not require to
> understand the arcana of the Arrow format and allows to potentially benefit
> from functionalities such as the encoding of the dictionary "for free", the
> automatic generation of Arrow schemas, the batching, the multi-column
> sorting, etc.
>
>
> I know that JSON can be used as a kind of intermediate representation in
> the context of Arrow with some language specific implementation. Current
> JSON integrations are insufficient to cover the most complex scenarios and
> are not standardized; e.g. support for most of the Arrow data type, various
> optimizations (string|binary dictionaries, multi-column sorting), batching,
> integration with Arrow IPC, compression ratio optimization, ... The object
> of this proposal is to progressively cover these gaps.
>
> I am looking to see if the community would be interested in such a
> contribution. Above are some additional details on the current
> implementation. All feedback is welcome.
>
> 10K ft overview of the current implementation:
>
>1. Developers convert their row oriented stream into records based on
>the Arrow Intermediate Representation (AIR). At this stage the
> translation
>can be quite mechanical but if needed developers can decide for example
> to
>translate a map into a struct if that makes sense for them. The current
>implementation support the following arrow data types: bool, all uints,
> all
>ints, all floats, string, binary, list of any supported types, and
> struct
>of any supported types. Additional Arrow types could be added
> progressively.
>2. The row oriented record (i.e. AIR record) is then added to a
>RecordRepository. This repository will first compute a schema signature
> and
>will route the record to a RecordBatcher based on this signature.
>3. The RecordBatcher is responsible for collecting all the compatible
>AIR records and, upon request, the "batcher" is able to build an Arrow
>Record representing a batch of compatible inputs. In the current
>implementation, the batcher is able to convert string columns to
> dictionary
>based on a configuration. Another configuration allows to evaluate which
>columns should be sorted to optimize the compression ratio. The same
>optimization process could be applied to binary columns.
>4. Steps 1 through 3 can be repeated on the same RecordRepository
>instance to build new sets of arrow record batches. Subsequent
> iterations
>will be slightly faster due to different techniques used (e.g. object
>reuse, dictionary reuse and sorting, ...)
>
>
> The current Go implementation
> <
> https://urldefense.com/v3/__https://github.com/lquer

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
I am +0 on a standard API -- in the Rust arrow-rs implementation we tend to
borrow inspiration from the C++ / Java interfaces and then create
appropriate Rust APIs.

There is also a row based format in DataFusion [1] (Rust) and it is used to
implement certain GroupBy and Sorts (similarly to what Sasha Krassovsky
describes for Acero).

I think row based formats are common in vectorized query engines for
operations that can't be easily vectorized (sorts, groups and joins),
though I am not sure how reusable those formats would be

There are at least three uses that require slightly different layouts
1. Comparing row formatted data for equality (where space efficiency is
important)
2. Comparing row formatted data for comparisons (where collation is
important)
3. Using row formatted data to hold intermediate aggregates (where word
alignment is important)

So in other words, I am not sure how easy it would be to define a common
in-memory layout for rows.

Andrew

[1]
https://github.com/apache/arrow-datafusion/blob/3cd62e9/datafusion/row/src/layout.rs#L29-L75



On Fri, Jul 29, 2022 at 2:06 AM Laurent Quérel 
wrote:

> Hi Sasha,
> Thank you very much for this informative comment. It's interesting to see
> another use of a row-based API in the context of a query engine. I think
> that there is some thought to be given to whether or not it is possible to
> converge these two use cases into a single public row-based API.
>
> As a first reaction I would say that it is not necessarily easy to
> reconcile because the constraints and the goals to be optimized are
> relatively disjoint. If you see a way to do it I'm extremely interested.
>
> If I understand correctly, in your case, you want to optimize the
> conversion from column to row representation and vice versa (a kind of
> bidirectional projection). Having a SIMD implementation of these
> conversions is just fantastic. However it seems that in your case there is
> no support for nested types yet and I feel like there is no public API to
> build rows in a simple and ergonomic way outside this bridge with the
> column-based representation.
>
> In the use case I'm trying to solve, the criteria to optimize are 1) expose
> a row-based API that offers the least amount of friction in the process of
> converting any row-based source to Arrow, which implies an easy-to-use API
> and support for nested types, 2) make it easy to create an efficient Arrow
> schema by automating dictionary creation and multi-column sorting in a way
> that makes Arrow easy to use for the casual user.
>
> The criteria to be optimized seem relatively disjointed to me but again I
> would be willing to dig with you a solution that offers a good compromise
> for these two use cases.
>
> Best,
> Laurent
>
>
>
> On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > Hi everyone,
> > I just wanted to chime in that we already do have a form of row-oriented
> > storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> > rows inside of GroupBy and Join within Acero. We also have utilities for
> > converting to/from columnar storage (and AVX2 implementations of these
> > conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> > useful to standardize this row-oriented format?
> >
> > As far as I understand fixed-width rows would be trivially convertible
> > into this representation (just a pointer to your array of structs), while
> > variable-width rows would need a little bit of massaging (though not too
> > much) to be put into this representation.
> >
> > Sasha Krassovsky
> >
> > > On Jul 28, 2022, at 1:10 PM, Laurent Quérel 
> > wrote:
> > >
> > > Thank you Micah for a very clear summary of the intent behind this
> > > proposal. Indeed, I think that clarifying from the beginning that this
> > > approach aims at facilitating experimentation more than efficiency in
> > terms
> > > of performance of the transformation phase would have helped to better
> > > understand my objective.
> > >
> > > Regarding your question, I don't think there is a specific technical
> > reason
> > > for such an integration in the core library. I was just thinking that
> it
> > > would make this infrastructure easier to find for the users and that
> this
> > > topic was general enough to find its place in the standard library.
> > >
> > > Best,
> > > Laurent
> > >
> > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > >> Hi Laurent,
> > >> I'm retitling this thread to include the specific languages you seem
> to
> > be
> > >> targeting in the subject line to hopefully get more eyes from
> > maintainers
> > >> in those languages.
> > >>
> > >> Thanks for clarifying the goals.  If I can restate my understanding,
> the
> > >> intended use-case here is to provide easy (from the developer point of
> > >> view) adaptation of row based formats to Arrow.  The means of
> achieving
> > >> this is creating an API f

Re: [VOTE] Release Apache Arrow 9.0.0 - RC1

2022-07-29 Thread Sutou Kouhei
-1

Sorry. I found a problem in Linux packages.
I'm fixing this at
https://github.com/apache/arrow/pull/13739 .


Thanks,
-- 
kou

In 
  "[VOTE] Release Apache Arrow 9.0.0 - RC1" on Thu, 28 Jul 2022 16:47:33 +0200,
  Krisztián Szűcs  wrote:

> Hi,
> 
> I would like to propose the following release candidate (RC1) of Apache
> Arrow version 9.0.0. This is a release consisting of 501
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> 6b59b2f498cd03e50c88d400a83cfc360fb3d1f1 [2]
> 
> The source release rc1 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 9.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 9.0.0 because...
> 
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%209.0.0
> [2]: 
> https://github.com/apache/arrow/tree/6b59b2f498cd03e50c88d400a83cfc360fb3d1f1
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-9.0.0-rc1
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/9.0.0-rc1
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/9.0.0-rc1
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/9.0.0-rc1
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]: 
> https://github.com/apache/arrow/blob/6b59b2f498cd03e50c88d400a83cfc360fb3d1f1/CHANGELOG.md
> [13]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates