Re: [VOTE][Format] Variable shape tensor canonical extension type

2023-10-06 Thread Rok Mihevc
Hey All,

We have 4 binding +1 votes, no non-binding +1 votes, and no -1 votes, so
the vote passes.

Thanks everyone for your work and participation on this!

As a follow up we will:
[ ] merge changes to the format (
https://github.com/apache/arrow/pull/37166/files)
[ ] merge C++ and Python implementation (
https://github.com/apache/arrow/pull/38008)


Rok

On Mon, Oct 2, 2023 at 4:25 PM Rok Mihevc  wrote:

> +1
> Thanks everyone for voting!
>
> I'd like to leave the vote open until Wednesday,
>
> Rok
>
> On Fri, Sep 29, 2023 at 8:58 PM Matt Topol  wrote:
>
>> +1
>>
>> Thanks for all the work here!
>>
>> On Fri, Sep 29, 2023 at 11:04 AM Dewey Dunnington
>>  wrote:
>>
>> > +1! Thank you for iterating on this with all of us!
>> >
>> > On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim
>> >  wrote:
>> > >
>> > > +1
>> > > Thanks for pushing this through!
>> > >
>> > > On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc 
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Following the discussion [1][2] I would like to propose a vote to
>> add
>> > > > variable shape tensor canonical extension type language to
>> > > > CanonicalExtensions.rst [3] as written below.
>> > > > A draft C++ implementation and a Python wrapper can be seen here
>> [2].
>> > > >
>> > > > The vote will be open for at least 72 hours.
>> > > >
>> > > > [ ] +1 Accept this proposal
>> > > > [ ] +0
>> > > > [ ] -1 Do not accept this proposal because...
>> > > >
>> > > >
>> > > > [1]
>> https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
>> > > > [2] https://github.com/apache/arrow/pull/37166
>> > > > [3]
>> > > >
>> > > >
>> >
>> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
>> > > >
>> > > >
>> > > > Variable shape tensor
>> > > > =
>> > > >
>> > > > * Extension name: `arrow.variable_shape_tensor`.
>> > > >
>> > > > * The storage type of the extension is: ``StructArray`` where struct
>> > > >   is composed of **data** and **shape** fields describing a single
>> > > >   tensor per row:
>> > > >
>> > > >   * **data** is a ``List`` holding tensor elements of a single
>> tensor.
>> > > > Data type of the list elements is uniform across the entire
>> column.
>> > > >   * **shape** is a ``FixedSizeList[ndim]`` of the tensor
>> shape
>> > > > where
>> > > > the size of the list ``ndim`` is equal to the number of
>> dimensions
>> > of
>> > > > the
>> > > > tensor.
>> > > >
>> > > > * Extension type parameters:
>> > > >
>> > > >   * **value_type** = the Arrow data type of individual tensor
>> elements.
>> > > >
>> > > >   Optional parameters describing the logical layout:
>> > > >
>> > > >   * **dim_names** = explicit names of tensor dimensions
>> > > > as an array. The length of it should be equal to the shape
>> > > > length and equal to the number of dimensions.
>> > > >
>> > > > ``dim_names`` can be used if the dimensions have well-known
>> > > > names and they map to the physical layout (row-major).
>> > > >
>> > > >   * **permutation**  = indices of the desired ordering of the
>> > > > original dimensions, defined as an array.
>> > > >
>> > > > The indices contain a permutation of the values [0, 1, .., N-1]
>> > where
>> > > > N is the number of dimensions. The permutation indicates which
>> > > > dimension of the logical layout corresponds to which dimension
>> of
>> > the
>> > > > physical tensor (the i-th dimension of the logical view
>> corresponds
>> > > > to the dimension with number ``permutations[i]`` of the physical
>> > > > tensor).
>> > > >
>> > > > Permutation can be useful in case the logical order of
>> > > > the tensor is a permutation of the physical order (row-major).
>> > > >
>> > > > When logical and physical layout are equal, the permutation will
>> > always
>> > > > be ([0, 1, .., N-1]) and can therefore be left out.
>> > > >
>> > > >   * **uniform_dimensions** = indices of dimensions whose sizes are
>> > > > guaranteed to remain constant. Indices are a subset of all
>> possible
>> > > > dimension indices ([0, 1, .., N-1]).
>> > > > The uniform dimensions must still be represented in the
>> ``shape``
>> > > > field,
>> > > > and must always be the same value for all tensors in the array
>> --
>> > this
>> > > > allows code to interpret the tensor correctly without accounting
>> > for
>> > > > uniform dimensions while still permitting optional optimizations
>> > that
>> > > > take advantage of the uniformity. ``uniform_dimensions`` can be
>> > left
>> > > > out,
>> > > > in which case it is assumed that all dimensions might be
>> variable.
>> > > >
>> > > >   * **uniform_shape** = shape of the dimensions that are guaranteed
>> to
>> > stay
>> > > > constant over all tensors in the array, with the shape of the
>> > ragged
>> > > > dimensions
>> > > > set to 0.
>> > > > An array containing a tensor with shape (2, 3, 4) and
>> > > > ``uniform_dimensions``
>> > 

Re: [VOTE][Format] Variable shape tensor canonical extension type

2023-10-06 Thread Joris Van den Bossche
Worth noting that here were some minor changes made to the spec while
the vote was active:

- The "uniform_dimensions" metadata key was removed, since this can
also be inferred from the "uniform_shape" information
- The shape of non-constant dimensions in the "uniform_shape" entry is
now represented by a "null" instead of "0"

(this is all about optional metadata)

Joris

On Fri, 6 Oct 2023 at 13:07, Rok Mihevc  wrote:
>
> Hey All,
>
> We have 4 binding +1 votes, no non-binding +1 votes, and no -1 votes, so
> the vote passes.
>
> Thanks everyone for your work and participation on this!
>
> As a follow up we will:
> [ ] merge changes to the format (
> https://github.com/apache/arrow/pull/37166/files)
> [ ] merge C++ and Python implementation (
> https://github.com/apache/arrow/pull/38008)
>
>
> Rok
>
> On Mon, Oct 2, 2023 at 4:25 PM Rok Mihevc  wrote:
>
> > +1
> > Thanks everyone for voting!
> >
> > I'd like to leave the vote open until Wednesday,
> >
> > Rok
> >
> > On Fri, Sep 29, 2023 at 8:58 PM Matt Topol  wrote:
> >
> >> +1
> >>
> >> Thanks for all the work here!
> >>
> >> On Fri, Sep 29, 2023 at 11:04 AM Dewey Dunnington
> >>  wrote:
> >>
> >> > +1! Thank you for iterating on this with all of us!
> >> >
> >> > On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim
> >> >  wrote:
> >> > >
> >> > > +1
> >> > > Thanks for pushing this through!
> >> > >
> >> > > On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc 
> >> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > Following the discussion [1][2] I would like to propose a vote to
> >> add
> >> > > > variable shape tensor canonical extension type language to
> >> > > > CanonicalExtensions.rst [3] as written below.
> >> > > > A draft C++ implementation and a Python wrapper can be seen here
> >> [2].
> >> > > >
> >> > > > The vote will be open for at least 72 hours.
> >> > > >
> >> > > > [ ] +1 Accept this proposal
> >> > > > [ ] +0
> >> > > > [ ] -1 Do not accept this proposal because...
> >> > > >
> >> > > >
> >> > > > [1]
> >> https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
> >> > > > [2] https://github.com/apache/arrow/pull/37166
> >> > > > [3]
> >> > > >
> >> > > >
> >> >
> >> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
> >> > > >
> >> > > >
> >> > > > Variable shape tensor
> >> > > > =
> >> > > >
> >> > > > * Extension name: `arrow.variable_shape_tensor`.
> >> > > >
> >> > > > * The storage type of the extension is: ``StructArray`` where struct
> >> > > >   is composed of **data** and **shape** fields describing a single
> >> > > >   tensor per row:
> >> > > >
> >> > > >   * **data** is a ``List`` holding tensor elements of a single
> >> tensor.
> >> > > > Data type of the list elements is uniform across the entire
> >> column.
> >> > > >   * **shape** is a ``FixedSizeList[ndim]`` of the tensor
> >> shape
> >> > > > where
> >> > > > the size of the list ``ndim`` is equal to the number of
> >> dimensions
> >> > of
> >> > > > the
> >> > > > tensor.
> >> > > >
> >> > > > * Extension type parameters:
> >> > > >
> >> > > >   * **value_type** = the Arrow data type of individual tensor
> >> elements.
> >> > > >
> >> > > >   Optional parameters describing the logical layout:
> >> > > >
> >> > > >   * **dim_names** = explicit names of tensor dimensions
> >> > > > as an array. The length of it should be equal to the shape
> >> > > > length and equal to the number of dimensions.
> >> > > >
> >> > > > ``dim_names`` can be used if the dimensions have well-known
> >> > > > names and they map to the physical layout (row-major).
> >> > > >
> >> > > >   * **permutation**  = indices of the desired ordering of the
> >> > > > original dimensions, defined as an array.
> >> > > >
> >> > > > The indices contain a permutation of the values [0, 1, .., N-1]
> >> > where
> >> > > > N is the number of dimensions. The permutation indicates which
> >> > > > dimension of the logical layout corresponds to which dimension
> >> of
> >> > the
> >> > > > physical tensor (the i-th dimension of the logical view
> >> corresponds
> >> > > > to the dimension with number ``permutations[i]`` of the physical
> >> > > > tensor).
> >> > > >
> >> > > > Permutation can be useful in case the logical order of
> >> > > > the tensor is a permutation of the physical order (row-major).
> >> > > >
> >> > > > When logical and physical layout are equal, the permutation will
> >> > always
> >> > > > be ([0, 1, .., N-1]) and can therefore be left out.
> >> > > >
> >> > > >   * **uniform_dimensions** = indices of dimensions whose sizes are
> >> > > > guaranteed to remain constant. Indices are a subset of all
> >> possible
> >> > > > dimension indices ([0, 1, .., N-1]).
> >> > > > The uniform dimensions must still be represented in the
> >> ``shape``
> >> > > > field,
> >> > > > and must always be the same value for all tensors in the array
> >> --
> >> > this
>

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Andrew Lamb
Given I don't see any input from the DuckDB / Velox development team (this
discussion seems primarily Arrow developers) I have filed a ticket in
DuckDB requesting their consideration[1] and tried to bump the attention of
the existing ticket in Velox[2]. Perhaps their input will provide a way
forward.

[1]: https://github.com/duckdb/duckdb/discussions/9248
[2]:
https://github.com/facebookincubator/velox/discussions/4362#discussioncomment-7209755



On Tue, Oct 3, 2023 at 3:24 AM Antoine Pitrou  wrote:

>
> Le 03/10/2023 à 01:36, Matt Topol a écrit :
> >
> > The cost of conversion is actually significantly higher than the actual
> > overhead of simply accessing the values in either representation, leading
> > to a high potential for bottleneck. For systems like Velox and DuckDB
> where
> > it's important to be able to return results as fast as possible, if they
> > have an operation with a throughput of several hundred MB/s or even G/s,
> > this conversion cost would become a huge bottleneck to returning results
> > given several cases of converting Raw Pointer views to the offset-based
> > views go as low as ~22MB/s.
>
> I think you misread the benchmark numbers. It's 22 MItems/s, not 22 MB/s.
> Since that number is for the kLongAndSeldomInlineable case, I assume the
> MB/s would two or three orders of magnitude higher.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS][Rust][DataFusion][HiveMetaStore] Possible Metastore integration with Data Fusion

2023-10-06 Thread Andrew Lamb
You might be able to find relevant examples in the Delta lake Rust
implementation[1]  which I believe features DataFusion catalog integration

You might also find the "Catalogs, Schemas, and Tables" section of the
library developers guide[2] helpful.

[1] https://github.com/delta-io/delta-rs
[2] https://arrow.apache.org/datafusion/library-user-guide/catalogs.html

On Wed, Oct 4, 2023 at 1:52 PM Kothapalli, Vamsi
 wrote:

> Thanks @chao ,@raphael for the suggestions and getting the discussion
> started 🙂
>
> I have a question regarding the client side implementation how would it
> look like or
> what libraries are already available to leveraged to build this feature!
> Any suggestions are much
> appreciated!
>
> A client code talking to hive metastore to get list of tables and then
> registering them in runtime env
> of data fusion? is there a lib for that kind of functionality in rust that
> you know of?
>
>
> I see that to build this functionality, I should get started with a repo
> in  datafusion-contrib,
>  or start working on it in my repo and then if everything looks good then
> we can move it to
>  datafusion-contrib, github org?
>
> Thanks,
> Vamsi
> 
> From: Chao Sun 
> Sent: Wednesday, October 4, 2023 8:35 AM
> To: dev@arrow.apache.org 
> Subject: Re: [DISCUSS][Rust][DataFusion][HiveMetaStore] Possible Metastore
> integration with Data Fusion
>
> This email is from an external sender.
>
>
> I think for DataFusion to Hive metastore we just need to implement the HMS
> thrift API and create a client implementation. Agree this is best suited
> under datafusion-contrib.
>
> On Wed, Oct 4, 2023 at 4:06 AM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > I think [1] might be a good place to start and handle coordination for
> > this undertaking. I suspect it would probably want to live under the
> > datafusion-contrib organisation, similar to HDFS [2]
> >
> > Kind Regards,
> >
> > Raphael
> >
> > [1]: https://github.com/apache/arrow-datafusion/issues/2209
> > [2]: https://github.com/datafusion-contrib/datafusion-objectstore-hdfs
> >
> > On 04/10/2023 00:18, Kothapalli, Vamsi wrote:
> > > Hi devs,
> > >
> > > I would like to start discussion with possible integration of Hive
> > Metastore with Data fusion,
> > >
> > >
> > > Simmilarly along the lines of this Glue catalog itegration
> > > https://github.com/apache/arrow-datafusion/issues/2206, Can anyone
> > suggest me how the code should look like if
> > > I or some would like to work on building HiveMetastore as catalog
> > provider feature in data fusion
> > >
> > > Thanks,
> > > Vamsi
> > >
> > >
> > > [
> >
> https://opengraph.githubassets.com/e562cb13412873d4ec10975083b49f02937926cfb878b1c07f4daf147c1f7494/apache/arrow-datafusion/issues/2206
> > ]
> > > [datafusion-contrib] AWS Glue Integration · Issue #2206 ·
> > apache/arrow-datafusion<
> > https://github.com/apache/arrow-datafusion/issues/2206>
> > > Is your feature request related to a problem or challenge? Please
> > describe what you are trying to do. This has been discussed in various
> > places, #907 and datafusion-contrib/datafusion-objectstore-s...
> > > github.com
> > >
> > >
> >
>


Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Mark Raasveldt
For the index vs pointer question - DuckDB went with pointers as they are more 
flexible, and DuckDB was designed to consume data (and strings) from a wide 
variety of formats in a wide variety of languages. Pointers allows us to easily 
zero-copy from e.g. Python strings, R strings, Arrow strings, etc. The flip 
side of pointers is that ownership has to be handled very carefully. Our vector 
format is an execution-only format, and never leaves the internals of the 
engine. This greatly simplifies ownership as we are in complete control of what 
happens inside the engine. For an interchange format that is intended for 
handing data between engines, I can see this being more complicated and having 
verification being more important.

As for the actual change:

From an interchange perspective from DuckDB's side - the proposed zero-copy 
integration would definitely speed up the conversion of DuckDB string vectors 
to Arrow string vectors. In a recent benchmark that we have performed we have 
found string conversion to Arrow vectors to be a bottleneck in certain 
workloads, although we have not sufficiently researched if this could be 
improved in other ways. It is possible this can be alleviated without requiring 
changes to Arrow.

However - in general, a new string vector format is only useful if consumers 
also support the format. If the consumer immediately converts the strings back 
into the standard Arrow string representation then there is no benefit. The 
change will only move where the conversion happens (from inside DuckDB to 
inside the consumer). As such, this change is only useful if the broader Arrow 
ecosystem moves towards supporting the new string format.

From an execution perspective from DuckDB's side - it is unlikely that we will 
switch to using Arrow as an internal format at this stage of the project. While 
this change increases Arrow's utility as an intermediate execution format, that 
is more relevant to projects that are currently using Arrow in this manner or 
are planning to use Arrow in this manner.

I feel the broader question here is what is Arrow's intended use case - 
interchange or execution - as they are opposed in this discussion. This change 
improves Arrow's utility as an execution format at the expense of more 
stability in the interchange format. From my perspective Arrow is more useful 
as an interchange format. When different tools communicate with each other a 
standard is required. An execution format is generally not exposed outside of 
the internals of the execution engine. Engines can do whatever they want here - 
and a standard is perhaps not as useful.

On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > I don't think "we have to adjust the Arrow format so that existing
> > internal representations become Arrow-compliant without any
> > (re-)implementation effort" is a reasonable design principle.
> 
> I agree with this statement from Antoine -- given the Arrow community has
> standardized an addition to the format with StringView, I think it would
> help to get some input from those at DuckDB and Velox on their perspective
> 
> Andrew
> 
> 
> 
> 
> On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
>  wrote:
> 
> > Oh I'm with you on it being a precedent we want to be very careful about
> > setting, but if there isn't a meaningful performance difference, we may
> > be able to sidestep that discussion entirely.
> >
> > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > >
> > > Even if performance were significant better, I don't think it's a good
> > > enough reason to add these representations to Arrow. By construction,
> > > a standard cannot continuously chase the performance state of art, it
> > > has to weigh the benefits of performance improvements against the
> > > increased cost for the ecosystem (for example the cost of adapting to
> > > frequent standard changes and a growing standard size).
> > >
> > > We have extension types which could reasonably be used for
> > > non-standard data types, especially the kind that are motivated by
> > > leading-edge performance research and innovation and come with unusual
> > > constraints (such as requiring trusting and dereferencing raw pointers
> > > embedded in data buffers). There could even be an argument for making
> > > some of them canonical extension types if there's enough anteriority
> > > in favor.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > >> I think what would really help would be some concrete numbers, do we
> > >> have any numbers comparing the performance of the offset and pointer
> > >> based representations? If there isn't a significant performance
> > >> difference between them, would the systems that currently use a
> > >> pointer-based approach be willing to meet us in the middle and switch to
> > >> an offset based encoding? This to me feels like it would be the best
> > >> outcome for the ecosystem as a whole.
> > >>
> > >

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Weston Pace
> I feel the broader question here is what is Arrow's intended use case -
interchange or execution

The line between interchange and execution is not always clear.  For
example, I think we would like Arrow to be considered as a standard for UDF
libraries.

On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt  wrote:

> For the index vs pointer question - DuckDB went with pointers as they are
> more flexible, and DuckDB was designed to consume data (and strings) from a
> wide variety of formats in a wide variety of languages. Pointers allows us
> to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> etc. The flip side of pointers is that ownership has to be handled very
> carefully. Our vector format is an execution-only format, and never leaves
> the internals of the engine. This greatly simplifies ownership as we are in
> complete control of what happens inside the engine. For an interchange
> format that is intended for handing data between engines, I can see this
> being more complicated and having verification being more important.
>
> As for the actual change:
>
> From an interchange perspective from DuckDB's side - the proposed
> zero-copy integration would definitely speed up the conversion of DuckDB
> string vectors to Arrow string vectors. In a recent benchmark that we have
> performed we have found string conversion to Arrow vectors to be a
> bottleneck in certain workloads, although we have not sufficiently
> researched if this could be improved in other ways. It is possible this can
> be alleviated without requiring changes to Arrow.
>
> However - in general, a new string vector format is only useful if
> consumers also support the format. If the consumer immediately converts the
> strings back into the standard Arrow string representation then there is no
> benefit. The change will only move where the conversion happens (from
> inside DuckDB to inside the consumer). As such, this change is only useful
> if the broader Arrow ecosystem moves towards supporting the new string
> format.
>
> From an execution perspective from DuckDB's side - it is unlikely that we
> will switch to using Arrow as an internal format at this stage of the
> project. While this change increases Arrow's utility as an intermediate
> execution format, that is more relevant to projects that are currently
> using Arrow in this manner or are planning to use Arrow in this manner.
>
> I feel the broader question here is what is Arrow's intended use case -
> interchange or execution - as they are opposed in this discussion. This
> change improves Arrow's utility as an execution format at the expense of
> more stability in the interchange format. From my perspective Arrow is more
> useful as an interchange format. When different tools communicate with each
> other a standard is required. An execution format is generally not exposed
> outside of the internals of the execution engine. Engines can do whatever
> they want here - and a standard is perhaps not as useful.
>
> On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > I don't think "we have to adjust the Arrow format so that existing
> > > internal representations become Arrow-compliant without any
> > > (re-)implementation effort" is a reasonable design principle.
> >
> > I agree with this statement from Antoine -- given the Arrow community has
> > standardized an addition to the format with StringView, I think it would
> > help to get some input from those at DuckDB and Velox on their
> perspective
> >
> > Andrew
> >
> >
> >
> >
> > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Oh I'm with you on it being a precedent we want to be very careful
> about
> > > setting, but if there isn't a meaningful performance difference, we may
> > > be able to sidestep that discussion entirely.
> > >
> > > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > > >
> > > > Even if performance were significant better, I don't think it's a
> good
> > > > enough reason to add these representations to Arrow. By construction,
> > > > a standard cannot continuously chase the performance state of art, it
> > > > has to weigh the benefits of performance improvements against the
> > > > increased cost for the ecosystem (for example the cost of adapting to
> > > > frequent standard changes and a growing standard size).
> > > >
> > > > We have extension types which could reasonably be used for
> > > > non-standard data types, especially the kind that are motivated by
> > > > leading-edge performance research and innovation and come with
> unusual
> > > > constraints (such as requiring trusting and dereferencing raw
> pointers
> > > > embedded in data buffers). There could even be an argument for making
> > > > some of them canonical extension types if there's enough anteriority
> > > > in favor.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > > >> I think what would really help would be some

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Felipe Oliveira Carvalho
Hello,

Since existing C Data Interface integrations sometimes don't parse beyond
the first `l` (or `L`) I'm going to start a new [VOTE] thread with Dewey's
suggestion:

+vl and +vL

If anyone objects to that and has a different suggestion, reply here so I
don't have to spam the list with too many new threads.

--
Felipe

On Thu, Oct 5, 2023 at 6:49 PM Dewey Dunnington
 wrote:

> I won't belabour the point any more, but the difference in layout
> between a list and a list view is consequential enough to deserve its
> own top-level character in my opinion. My vote would be +1 for +vl and
> +vL.
>
> On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho
>  wrote:
> >
> > > Union format strings share enough properties that having them in the
> > > same switch case doesn't result in additional complexity...lists and
> > > list views are completely different types (for the purposes of parsing
> > > the format string).
> >
> > Dense and sparse union differ a bit more than list and list-view.
> >
> > Not starting with `+l` for list-views would be a deviation from this
> > pattern started by unions.
> >
> >
> ++---++
> > | ``+ud:I,J,...``| dense union with type ids I,J...
> >  ||
> >
> ++---++
> > | ``+us:I,J,...``| sparse union with type ids I,J...
> >   ||
> >
> ++---++
> >
> > Is sharing prefixes an issue?
> >
> > To make this more concrete, these are the parser changes for supporting
> > `+lv` and `+Lv` as I proposed in the beginning:
> >
> > @@ -1097,9 +1101,9 @@ struct SchemaImporter {
> >  RETURN_NOT_OK(f_parser_.CheckHasNext());
> >  switch (f_parser_.Next()) {
> >case 'l':
> > -return ProcessListLike();
> > +return ProcessVarLengthList();
> >case 'L':
> > -return ProcessListLike();
> > +return ProcessVarLengthList();
> >case 'w':
> >  return ProcessFixedSizeList();
> >case 's':
> > @@ -1195,12 +1199,30 @@ struct SchemaImporter {
> >  return CheckNoChildren(type);
> >}
> >
> > -  template 
> > -  Status ProcessListLike() {
> > -RETURN_NOT_OK(f_parser_.CheckAtEnd());
> > -RETURN_NOT_OK(CheckNumChildren(1));
> > -ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> > -type_ = std::make_shared(field);
> > +  template 
> > +  Status ProcessVarLengthList() {
> > +if (f_parser_.AtEnd()) {
> > +  RETURN_NOT_OK(CheckNumChildren(1));
> > +  ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> > +  if constexpr (is_large_variation) {
> > +type_ = large_list(field);
> > +  } else {
> > +type_ = list(field);
> > +  }
> > +} else {
> > +  if (f_parser_.Next() == 'v') {
> > +RETURN_NOT_OK(CheckNumChildren(1));
> > +ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
> > +if constexpr (is_large_variation) {
> > +  type_ = large_list_view(field);
> > +} else {
> > +  type_ = list_view(field);
> > +}
> > +  } else {
> > +return f_parser_.Invalid();
> > +  }
> > +}
> > +
> >  return Status::OK();
> >}
> >
> > --
> > Felipe
> >
> >
> > On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > I don't think the parsing will be a problem even in C. It's not like
> you
> > > have to backtrack anyway.
> > >
> > > +1 from me on Felipe's proposal.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :
> > > > This mailing list thread is going to be the discussion.
> > > >
> > > > The union types also use two characters, so I didn’t think it would
> be a
> > > > problem.
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington
> > > 
> > > > wrote:
> > > >
> > > >> I'm sorry for missing earlier discussion on this or a PR into the
> > > >> format where this discussion may have occurred...is there a reason
> > > >> that +lv and +Lv were chosen over a single-character version (i.e.,
> > > >> maybe +v and +V)? A single-character version is (slightly) easier to
> > > >> parse in C.
> > > >>
> > > >> On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
> > > >>  wrote:
> > > >>>
> > > >>> Hello,
> > > >>>
> > > >>> I'm writing to propose "+lv" and "+Lv" as format strings for
> list-view
> > > >> and
> > > >>> large list-view arrays passing through the Arrow C data interface
> [1].
> > > >>>
> > > >>> The vote will be open for at least 72 hours.
> > > >>>
> > > >>> [ ] +1 - I'm in favor of this new C Data Format string
> > > >>> [ ] +0
> > > >>> [ ] -1 - I'm against adding this new format string because
> > > >>>
> > > >>> Thanks everyone!
> > > >>>
> > > >>> --
> > > >>> Felipe
> > > >>>
> > >

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Antoine Pitrou




Le 06/10/2023 à 17:54, Felipe Oliveira Carvalho a écrit :

Hello,

Since existing C Data Interface integrations sometimes don't parse beyond
the first `l` (or `L`) I'm going to start a new [VOTE] thread with Dewey's
suggestion:


Regardless of which format string we choose for ListView, a bug should 
certainly be reported to these implementations. A robust implementation 
should ensure the imported format string is conformant, otherwise there 
is a risk that the exporter actually meant something else.


Regards

Antoine.




+vl and +vL

If anyone objects to that and has a different suggestion, reply here so I
don't have to spam the list with too many new threads.

--
Felipe

On Thu, Oct 5, 2023 at 6:49 PM Dewey Dunnington
 wrote:


I won't belabour the point any more, but the difference in layout
between a list and a list view is consequential enough to deserve its
own top-level character in my opinion. My vote would be +1 for +vl and
+vL.

On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho
 wrote:



Union format strings share enough properties that having them in the
same switch case doesn't result in additional complexity...lists and
list views are completely different types (for the purposes of parsing
the format string).


Dense and sparse union differ a bit more than list and list-view.

Not starting with `+l` for list-views would be a deviation from this
pattern started by unions.



++---++

| ``+ud:I,J,...``| dense union with type ids I,J...
  ||


++---++

| ``+us:I,J,...``| sparse union with type ids I,J...
   ||


++---++


Is sharing prefixes an issue?

To make this more concrete, these are the parser changes for supporting
`+lv` and `+Lv` as I proposed in the beginning:

@@ -1097,9 +1101,9 @@ struct SchemaImporter {
  RETURN_NOT_OK(f_parser_.CheckHasNext());
  switch (f_parser_.Next()) {
case 'l':
-return ProcessListLike();
+return ProcessVarLengthList();
case 'L':
-return ProcessListLike();
+return ProcessVarLengthList();
case 'w':
  return ProcessFixedSizeList();
case 's':
@@ -1195,12 +1199,30 @@ struct SchemaImporter {
  return CheckNoChildren(type);
}

-  template 
-  Status ProcessListLike() {
-RETURN_NOT_OK(f_parser_.CheckAtEnd());
-RETURN_NOT_OK(CheckNumChildren(1));
-ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
-type_ = std::make_shared(field);
+  template 
+  Status ProcessVarLengthList() {
+if (f_parser_.AtEnd()) {
+  RETURN_NOT_OK(CheckNumChildren(1));
+  ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+  if constexpr (is_large_variation) {
+type_ = large_list(field);
+  } else {
+type_ = list(field);
+  }
+} else {
+  if (f_parser_.Next() == 'v') {
+RETURN_NOT_OK(CheckNumChildren(1));
+ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+if constexpr (is_large_variation) {
+  type_ = large_list_view(field);
+} else {
+  type_ = list_view(field);
+}
+  } else {
+return f_parser_.Invalid();
+  }
+}
+
  return Status::OK();
}

--
Felipe


On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou 

wrote:




I don't think the parsing will be a problem even in C. It's not like

you

have to backtrack anyway.

+1 from me on Felipe's proposal.

Regards

Antoine.


Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :

This mailing list thread is going to be the discussion.

The union types also use two characters, so I didn’t think it would

be a

problem.

—
Felipe

On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington



wrote:


I'm sorry for missing earlier discussion on this or a PR into the
format where this discussion may have occurred...is there a reason
that +lv and +Lv were chosen over a single-character version (i.e.,
maybe +v and +V)? A single-character version is (slightly) easier to
parse in C.

On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
 wrote:


Hello,

I'm writing to propose "+lv" and "+Lv" as format strings for

list-view

and

large list-view arrays passing through the Arrow C data interface

[1].


The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html












Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Neal Richardson
Agreed, it's unfortunately not just a simple tradeoff. We have discussed
this a bit in [1] and in several other threads around this topic. If we say
that Arrow is about interchange and not execution, so we shouldn't adopt
the pointer version that DuckDB uses, that means we're also making
interchange harder because of the need to convert from your internal format
to the Arrow format at the boundary. Adding the pointer version to the
arrow format solves that, but creates costs elsewhere.

IIUC Ben's proposal tried to solve this tension by making it possible for
two systems to agree to use the pointer version and pass data without
serialization costs. That comes with its own risks and tradeoffs.

This feels like another case where the "canonical alternative layout"
discussed in [1] could be a way to formalize this variation and allow it to
be used but not required in all implementations. One way or another, we
need to find a way to balance the desire for Arrow to be a universal
standard with the risk of diluting the standard to accommodate every
project.

Neal

[1]: https://lists.apache.org/thread/djl9dbd7qmozxtjpfzby40gg23x0o3wo

On Fri, Oct 6, 2023 at 11:47 AM Weston Pace  wrote:

> > I feel the broader question here is what is Arrow's intended use case -
> interchange or execution
>
> The line between interchange and execution is not always clear.  For
> example, I think we would like Arrow to be considered as a standard for UDF
> libraries.
>
> On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt  wrote:
>
> > For the index vs pointer question - DuckDB went with pointers as they are
> > more flexible, and DuckDB was designed to consume data (and strings)
> from a
> > wide variety of formats in a wide variety of languages. Pointers allows
> us
> > to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> > etc. The flip side of pointers is that ownership has to be handled very
> > carefully. Our vector format is an execution-only format, and never
> leaves
> > the internals of the engine. This greatly simplifies ownership as we are
> in
> > complete control of what happens inside the engine. For an interchange
> > format that is intended for handing data between engines, I can see this
> > being more complicated and having verification being more important.
> >
> > As for the actual change:
> >
> > From an interchange perspective from DuckDB's side - the proposed
> > zero-copy integration would definitely speed up the conversion of DuckDB
> > string vectors to Arrow string vectors. In a recent benchmark that we
> have
> > performed we have found string conversion to Arrow vectors to be a
> > bottleneck in certain workloads, although we have not sufficiently
> > researched if this could be improved in other ways. It is possible this
> can
> > be alleviated without requiring changes to Arrow.
> >
> > However - in general, a new string vector format is only useful if
> > consumers also support the format. If the consumer immediately converts
> the
> > strings back into the standard Arrow string representation then there is
> no
> > benefit. The change will only move where the conversion happens (from
> > inside DuckDB to inside the consumer). As such, this change is only
> useful
> > if the broader Arrow ecosystem moves towards supporting the new string
> > format.
> >
> > From an execution perspective from DuckDB's side - it is unlikely that we
> > will switch to using Arrow as an internal format at this stage of the
> > project. While this change increases Arrow's utility as an intermediate
> > execution format, that is more relevant to projects that are currently
> > using Arrow in this manner or are planning to use Arrow in this manner.
> >
> > I feel the broader question here is what is Arrow's intended use case -
> > interchange or execution - as they are opposed in this discussion. This
> > change improves Arrow's utility as an execution format at the expense of
> > more stability in the interchange format. From my perspective Arrow is
> more
> > useful as an interchange format. When different tools communicate with
> each
> > other a standard is required. An execution format is generally not
> exposed
> > outside of the internals of the execution engine. Engines can do whatever
> > they want here - and a standard is perhaps not as useful.
> >
> > On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > > I don't think "we have to adjust the Arrow format so that existing
> > > > internal representations become Arrow-compliant without any
> > > > (re-)implementation effort" is a reasonable design principle.
> > >
> > > I agree with this statement from Antoine -- given the Arrow community
> has
> > > standardized an addition to the format with StringView, I think it
> would
> > > help to get some input from those at DuckDB and Velox on their
> > perspective
> > >
> > > Andrew
> > >
> > >
> > >
> > >
> > > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> > >  wrote:
> > >
> > > > Oh I'm with you on 

[Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Felipe Oliveira Carvalho
Hello,

I'm writing to propose "+vl" and "+vL" as format strings for list-view and
large list-view arrays passing through the Arrow C data interface [1].

The previous proposal was considered a bad idea because existing parsers of
these format strings might be looking at only the first `l` (or `L`) after
the `+` and assuming the classic list format from that alone, so now I'm
proposing we start with a `+v` as this prefix is not shared with any other
existing type so far.

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Benjamin Kietzman
+1

On Fri, Oct 6, 2023, 17:27 Felipe Oliveira Carvalho 
wrote:

> Hello,
>
> I'm writing to propose "+vl" and "+vL" as format strings for list-view and
> large list-view arrays passing through the Arrow C data interface [1].
>
> The previous proposal was considered a bad idea because existing parsers of
> these format strings might be looking at only the first `l` (or `L`) after
> the `+` and assuming the classic list format from that alone, so now I'm
> proposing we start with a `+v` as this prefix is not shared with any other
> existing type so far.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --
> Felipe
>
> [1] https://arrow.apache.org/docs/format/CDataInterface.html
>


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Matt Topol
+1

On Fri, Oct 6, 2023, 6:55 PM Benjamin Kietzman  wrote:

> +1
>
> On Fri, Oct 6, 2023, 17:27 Felipe Oliveira Carvalho 
> wrote:
>
> > Hello,
> >
> > I'm writing to propose "+vl" and "+vL" as format strings for list-view
> and
> > large list-view arrays passing through the Arrow C data interface [1].
> >
> > The previous proposal was considered a bad idea because existing parsers
> of
> > these format strings might be looking at only the first `l` (or `L`)
> after
> > the `+` and assuming the classic list format from that alone, so now I'm
> > proposing we start with a `+v` as this prefix is not shared with any
> other
> > existing type so far.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 - I'm in favor of this new C Data Format string
> > [ ] +0
> > [ ] -1 - I'm against adding this new format string because
> >
> > Thanks everyone!
> >
> > --
> > Felipe
> >
> > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> >
>


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Dewey Dunnington
+1!

On Fri, Oct 6, 2023, 8:03 PM Matt Topol  wrote:

> +1
>
> On Fri, Oct 6, 2023, 6:55 PM Benjamin Kietzman 
> wrote:
>
> > +1
> >
> > On Fri, Oct 6, 2023, 17:27 Felipe Oliveira Carvalho  >
> > wrote:
> >
> > > Hello,
> > >
> > > I'm writing to propose "+vl" and "+vL" as format strings for list-view
> > and
> > > large list-view arrays passing through the Arrow C data interface [1].
> > >
> > > The previous proposal was considered a bad idea because existing
> parsers
> > of
> > > these format strings might be looking at only the first `l` (or `L`)
> > after
> > > the `+` and assuming the classic list format from that alone, so now
> I'm
> > > proposing we start with a `+v` as this prefix is not shared with any
> > other
> > > existing type so far.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of this new C Data Format string
> > > [ ] +0
> > > [ ] -1 - I'm against adding this new format string because
> > >
> > > Thanks everyone!
> > >
> > > --
> > > Felipe
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
> >
>


Re: [Java][Discuss]: consensus for JDK 8 deprecation

2023-10-06 Thread Micah Kornfield
I think given the stability of Arrow Java, dropping support probably makes
sense.  If a bug comes up or consumers really need to new features we can
always make a patch release of an older version.

On Thu, Oct 5, 2023 at 3:13 PM Dane Pitkin 
wrote:

> I also learned today that Apache Spark has dropped support for Java 8 and
> 11 for their next release (v4.0)[1]. Should we consider dropping Java 11 as
> well?
>
> [1]https://github.com/apache/spark/pull/43005
>
> -Dane
>
> On Thu, Oct 5, 2023 at 3:30 PM Dane Pitkin  wrote:
>
> > I created a GH issue[1] proposing the removal of Java 8 support. It
> > would target the Arrow v15 release (~Jan 2024).
> >
> > IMO it would be in the best interest of the project for two major
> reasons:
> > 1. Unblock the Java Platform Module System (JPMS)[2] implementation.
> > 2. Unblock Arrow from upgrading dependencies that no longer support Java
> > 8. (See [1] for examples)
> >
> > Since Arrow Java has been quite stable, will Java 8 users be okay with
> > pinning Arrow to the last supported release (v14) if the Arrow project
> > ultimately decides to remove Java 8 support?
> >
> >
> > [1]https://github.com/apache/arrow/issues/38051
> > [2]https://en.wikipedia.org/wiki/Java_Platform_Module_System
> >
> > -Dane
> >
> > On Fri, Sep 15, 2023 at 12:26 PM Dane Pitkin 
> wrote:
> >
> >> - As a low level library, users have to add specific flags to use
> >>>  Java 9 and up with Arrow to resolve issues with java.nio. This has
> >>>  been annoying for our customers constantly. If this is not resolved,
> >>>  I would say we may see a lot of complaints in the future.
> >>>
> >> I filed issue 37739[1] to track this, but it sounds like this can't be
> >> changed until Java 21 or 24.
> >>
> >> - It seems that the EOL of Java 8 from Oracle is Dec 2030 [2]. A lot
> >>>  users will still stay on it for a long time. At least this is true for
> >>> our
> >>>  customers. So I am afraid we may not upgrade to newer versions
> >>>  of Arrow if it no longer supports Java 8.
> >>>
> >> Java 8 does have a long Extended Support timeline, but a recent
> >> report shows Java 11 increasing in adoption vs Java 8. "More than 56% of
> >> applications are now using Java 11 in production (up from 48% in 2022
> and
> >> 11% in 2020). Java 8 is a close second with nearly 33% of applications
> >> using it in production (down from 46% in 2022)."[2]
> >> I expect the Java ecosystem will find a way to move on from Java 8 much
> >> sooner than 2030, meaning many of Arrow's dependencies could drop
> support
> >> for Java 8 before then. At this point, Arrow may be forced to support a
> >> higher minimum Java version.
> >>
> >> That being said, it's hard to argue against real use cases. I'd be
> >> curious to hear what Java version other users of Arrow are using (and if
> >> there is a timeline to upgrade if on Java 8).
> >>
> >>
> >> [1]https://github.com/apache/arrow/issues/37739
> >> [2]
> >>
> https://newrelic.com/sites/default/files/2023-04/new-relic-2023-state-of-the-java-ecosystem-2023-04-20.pdf
> >>
> >>
> >> -Dane
> >>
> >>
> >> On Thu, Sep 14, 2023 at 11:45 AM Gang Wu  wrote:
> >>
> >>> Thanks for bringing this up!
> >>>
> >>> I have two concerns of dropping Java 8 support:
> >>> - As a low level library, users have to add specific flags [1] to use
> >>>  Java 9 and up with Arrow to resolve issues with java.nio. This has
> >>>  been annoying for our customers constantly. If this is not resolved,
> >>>  I would say we may see a lot of complaints in the future.
> >>> - It seems that the EOL of Java 8 from Oracle is Dec 2030 [2]. A lot
> >>>  users will still stay on it for a long time. At least this is true for
> >>> our
> >>>  customers. So I am afraid we may not upgrade to newer versions
> >>>  of Arrow if it no longer supports Java 8.
> >>>
> >>> [1] https://arrow.apache.org/docs/java/install.html#java-compatibility
> >>> [2]
> >>> https://www.oracle.com/java/technologies/java-se-support-roadmap.html
> >>>
> >>> Best,
> >>> Gang
> >>>
> >>>
> >>>
> >>> On Thu, Sep 14, 2023 at 11:14 PM David Dali Susanibar Arce <
> >>> davi.sar...@gmail.com> wrote:
> >>>
> >>> > Hi Arrow Java developers,
> >>> >
> >>> > I would like to propose a timeline for dropping support for Java 8:
> >>> > - Propose to drop JDK8 in Arrow v15 (2 releases from now)
> >>> > - JDK 21 support will be added before removal of JDK8
> >>> >
> >>> > Why?
> >>> > - Java 8 no longer receives Premier Support (1)
> >>> > - Some Arrow Java (test) dependencies have already started to drop
> >>> > Java 8 support, forcing us to pin to older packager versions
> >>> >
> >>> > Also note:
> >>> > - gRPC Java may drop support for a JDK version when that version is
> no
> >>> > longer receiving Premier Support from Oracle (2), more detail at Java
> >>> > 8 / Java 11 support timeline in gRPC here (3)
> >>> > - Spark plans to tentatively drop JDK 8 support in Spark 4.0 (4),
> >>> > which has a release timeline of approximately 2024-06 (5). Is it fine
> >>> > for u

Re: [Java][Discuss]: consensus for JDK 8 deprecation

2023-10-06 Thread Jacob Wujciak-Jens
>From a release engineer perspective (without java knowledge) I agree with
Micah, I'd rather make a patch release for an older version if needed but
modernize the codebase and simplify CI!


On Sat, Oct 7, 2023 at 5:27 AM Micah Kornfield 
wrote:

> I think given the stability of Arrow Java, dropping support probably makes
> sense.  If a bug comes up or consumers really need to new features we can
> always make a patch release of an older version.
>
> On Thu, Oct 5, 2023 at 3:13 PM Dane Pitkin 
> wrote:
>
> > I also learned today that Apache Spark has dropped support for Java 8 and
> > 11 for their next release (v4.0)[1]. Should we consider dropping Java 11
> as
> > well?
> >
> > [1]https://github.com/apache/spark/pull/43005
> >
> > -Dane
> >
> > On Thu, Oct 5, 2023 at 3:30 PM Dane Pitkin  wrote:
> >
> > > I created a GH issue[1] proposing the removal of Java 8 support. It
> > > would target the Arrow v15 release (~Jan 2024).
> > >
> > > IMO it would be in the best interest of the project for two major
> > reasons:
> > > 1. Unblock the Java Platform Module System (JPMS)[2] implementation.
> > > 2. Unblock Arrow from upgrading dependencies that no longer support
> Java
> > > 8. (See [1] for examples)
> > >
> > > Since Arrow Java has been quite stable, will Java 8 users be okay with
> > > pinning Arrow to the last supported release (v14) if the Arrow project
> > > ultimately decides to remove Java 8 support?
> > >
> > >
> > > [1]https://github.com/apache/arrow/issues/38051
> > > [2]https://en.wikipedia.org/wiki/Java_Platform_Module_System
> > >
> > > -Dane
> > >
> > > On Fri, Sep 15, 2023 at 12:26 PM Dane Pitkin 
> > wrote:
> > >
> > >> - As a low level library, users have to add specific flags to use
> > >>>  Java 9 and up with Arrow to resolve issues with java.nio. This has
> > >>>  been annoying for our customers constantly. If this is not resolved,
> > >>>  I would say we may see a lot of complaints in the future.
> > >>>
> > >> I filed issue 37739[1] to track this, but it sounds like this can't be
> > >> changed until Java 21 or 24.
> > >>
> > >> - It seems that the EOL of Java 8 from Oracle is Dec 2030 [2]. A lot
> > >>>  users will still stay on it for a long time. At least this is true
> for
> > >>> our
> > >>>  customers. So I am afraid we may not upgrade to newer versions
> > >>>  of Arrow if it no longer supports Java 8.
> > >>>
> > >> Java 8 does have a long Extended Support timeline, but a recent
> > >> report shows Java 11 increasing in adoption vs Java 8. "More than 56%
> of
> > >> applications are now using Java 11 in production (up from 48% in 2022
> > and
> > >> 11% in 2020). Java 8 is a close second with nearly 33% of applications
> > >> using it in production (down from 46% in 2022)."[2]
> > >> I expect the Java ecosystem will find a way to move on from Java 8
> much
> > >> sooner than 2030, meaning many of Arrow's dependencies could drop
> > support
> > >> for Java 8 before then. At this point, Arrow may be forced to support
> a
> > >> higher minimum Java version.
> > >>
> > >> That being said, it's hard to argue against real use cases. I'd be
> > >> curious to hear what Java version other users of Arrow are using (and
> if
> > >> there is a timeline to upgrade if on Java 8).
> > >>
> > >>
> > >> [1]https://github.com/apache/arrow/issues/37739
> > >> [2]
> > >>
> >
> https://newrelic.com/sites/default/files/2023-04/new-relic-2023-state-of-the-java-ecosystem-2023-04-20.pdf
> > >>
> > >>
> > >> -Dane
> > >>
> > >>
> > >> On Thu, Sep 14, 2023 at 11:45 AM Gang Wu  wrote:
> > >>
> > >>> Thanks for bringing this up!
> > >>>
> > >>> I have two concerns of dropping Java 8 support:
> > >>> - As a low level library, users have to add specific flags [1] to use
> > >>>  Java 9 and up with Arrow to resolve issues with java.nio. This has
> > >>>  been annoying for our customers constantly. If this is not resolved,
> > >>>  I would say we may see a lot of complaints in the future.
> > >>> - It seems that the EOL of Java 8 from Oracle is Dec 2030 [2]. A lot
> > >>>  users will still stay on it for a long time. At least this is true
> for
> > >>> our
> > >>>  customers. So I am afraid we may not upgrade to newer versions
> > >>>  of Arrow if it no longer supports Java 8.
> > >>>
> > >>> [1]
> https://arrow.apache.org/docs/java/install.html#java-compatibility
> > >>> [2]
> > >>>
> https://www.oracle.com/java/technologies/java-se-support-roadmap.html
> > >>>
> > >>> Best,
> > >>> Gang
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Sep 14, 2023 at 11:14 PM David Dali Susanibar Arce <
> > >>> davi.sar...@gmail.com> wrote:
> > >>>
> > >>> > Hi Arrow Java developers,
> > >>> >
> > >>> > I would like to propose a timeline for dropping support for Java 8:
> > >>> > - Propose to drop JDK8 in Arrow v15 (2 releases from now)
> > >>> > - JDK 21 support will be added before removal of JDK8
> > >>> >
> > >>> > Why?
> > >>> > - Java 8 no longer receives Premier Support (1)
> > >>> > - Some Arrow Java (test) dependencies have a