Re: [DISCUSS] Drop Java 8 support

2024-05-26 Thread Gang Wu
Hi,

IMHO, Apache Parquet Java [1] cannot drop Java 8 in all 1.x releases
to keep maximum backward compatibility. There was a discussion on
the 2.x major release [2] and v3 format [3]. I think it is a good chance
to drop Java 8 from the 2.x release.

[1] https://github.com/apache/parquet-java
[2] https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o
[3] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo

Best,
Gang

On Fri, May 24, 2024 at 11:47 PM Weston Pace  wrote:

> No vote is required from an ASF perspective (this is not a release)
> No vote is required from Arrow conventions (this is not a spec change and
> does not impact more than one implementation)
>
> I will send a message to the parquet ML to solicit feedback.
>
> On Fri, May 24, 2024 at 8:22 AM Laurent Goujon  >
> wrote:
>
> > I would say so because it is akin to removing a large feature but maybe
> > some PMC can chime in?
> >
> > Laurent
> >
> > On Tue, May 21, 2024 at 12:16 PM Dane Pitkin  wrote:
> >
> > > I haven't been active in Apache Parquet, but I did not see any prior
> > > discussions on this topic in their Jira or dev mailing list.
> > >
> > > Do we think a vote is needed before officially moving forward with
> Java 8
> > > deprecation?
> > >
> > > On Mon, May 20, 2024 at 12:50 PM Laurent Goujon
> >  > > >
> > > wrote:
> > >
> > > > I also mentioned Apache Parquet and haven't seen someone mentioned
> > > if/when
> > > > Apache Parquet would transition.
> > > >
> > > >
> > > >
> > > > On Fri, May 17, 2024 at 9:07 AM Dane Pitkin 
> > wrote:
> > > >
> > > > > Fokko, thank you for these datapoints! It's great to see how other
> > low
> > > > > level Java OSS projects are approaching this.
> > > > >
> > > > > JB, I believe yes we have formal consensus to drop Java 8 in Arrow.
> > > There
> > > > > was no contention in current discussions across [GitHub issues |
> > Arrow
> > > > > Mailing List | Community Syncs].
> > > > >
> > > > > We can save Java 11 deprecation for a future discussion. For users
> on
> > > > Java
> > > > > 11, I do anticipate this discussion to come shortly after Java 8
> > > > > deprecation is released.
> > > > >
> > > > > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong <
> fo...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I was traveling the last few weeks, so just a follow-up from my
> > end.
> > > > > >
> > > > > > Fokko, can you elaborate on the discussions held in other OSS
> > > projects
> > > > to
> > > > > >> drop Java <17? How did they weigh the benefits/drawbacks for
> > > dropping
> > > > > both
> > > > > >> Java 8 and 11 LTS versions? I'd also be curious if other
> projects
> > > plan
> > > > > to
> > > > > >> support older branches with security patches.
> > > > > >
> > > > > >
> > > > > > So, the ones that I'm involved with (including a TLDR):
> > > > > >
> > > > > >- Avro:
> > > > > >   - (April 2024: Consensus on moving to 11+, +1 for moving to
> > > 17+)
> > > > > >
> > > https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > > > > >   - (Jan 2024: Consensus on dropping 8)
> > > > > >
> > > https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > > > > >   - Iceberg:
> > > > > >   - (Jan 2023: Concerns about Hive):
> > > > > >
> > > https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > > > > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to
> > JDK11+,
> > > > > >   also +1's for moving to 17+):
> > > > > >
> > > https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > > > > >
> > > > > > I think the most noteworthy (slow-moving in general):
> > > > > >
> > > > > >- Spark 4 supports JDK 17+
> > > > > >- Hive 4 is still on Java 8
> > > > > >
> > > > > >
> > > > > > It looks like most of the projects are looking at each other.
> Keep
> > in
> > > > > > mind, that projects that still support older versions of Java,
> can
> > > > still
> > > > > > use older versions of Arrow.
> > > > > >
> > > > > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > > > > (in case the image doesn't come through, that's Spiderman
> pointing
> > at
> > > > > > Spiderman)
> > > > > >
> > > > > > Concerning the Java 11 support, some data:
> > > > > >
> > > > > >- Oracle 11: support until January 2032 (extended fee has been
> > > > waived)
> > > > > >- Cornetto 11: September 2027
> > > > > >- Adoptium 11: At least Oct 2027
> > > > > >- Zulu 11: Jan 2032
> > > > > >- OpenJDK11: October 2024
> > > > > >
> > > > > > I think it is fair to support 11 for the time being, but at some
> > > point,
> > > > > we
> > > > > > also have to move on and start exploiting the new features and
> make
> > > > sure
> > > > > > that we keep up to date. For example, Java 8 also has extended
> > > support
> > > > > > until 2030. Dependabot on the Iceberg project
> > > > > > <
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).

Good point. I think that a process that unifies schemas
should remove (or merge if possible) statistics metadata. If
we standardize statistics, the process can do it. For
example, the process can always remove "ARROW:statistics"
metadata when we use "ARROW:statistics" for statistics.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
15:14:49 -0300,
  Dewey Dunnington  wrote:

> Thanks Shoumyo for bringing this up!
> 
> Using a schema to transmit statistica/data dependent values is also
> something we do in GeoParquet (whose schema also finds its way into
> pyarrow and the C data interface when reading). It is usually fine but
> occasionally ends up with schema metadata that is lying (e.g., when
> unifying schemas from multiple files in a dataset, I believe pyarrow
> will sometimes assign metadata from one file to the entire dataset
> and/or propagate it through projections/filters).
> 
> I imagine statistics would be opt-in (i.e., a consumer would have to
> explicitly request them), in which case that consumer could possibly
> be required to remove them. With the custom format string that was
> proposed I think this is unlikely to happen; however, that a consumer
> might want to know statistics over IPC too is an excellent point.
> 
>> Unless there are other ways of producing stream-level application metadata 
>> outside of the schema/field metadata
> 
> Technically there is message-level metadata in the IPC flatbuffers,
> although I don't believe it is accessible from most IPC readers. That
> mechanism isn't available from an ArrowArrayStream and so it might not
> help with the specific case at hand.
> 
>> nowhere is it mentioned that metadata must be used to determine schema 
>> equivalence
> 
> I am only familiar with a few implementations, but at least Arrow C++
> and nanoarrow have options to ignore metadata and/or nullability
> and/or possibly field names (e.g., for a list type) depending on what
> type of type/schema equivalence is required.
> 
>> use cases where you want to know the schema *before* the data is produced.
> 
> I may be understanding it incorrectly, but I think it's generally
> possible to emit a schema with metadata before emitting record
> batches. I suppose you would have already started downloading the
> stream, though.
> 
>> I think what we are slowly converging on is the need for a spec to
>> describe the encoding of Arrow array statistics as Arrow arrays.
> 
> +1 (this will be helpful however we decide to transmit statistics)
> 
> On Thu, May 23, 2024 at 1:57 PM Antoine Pitrou  wrote:
>>
>>
>> Hi Shoumyo,
>>
>> The problem with communicating data statistics through schema metadata
>> is that it's not compatible with use cases where you want to know the
>> schema *before* the data is produced.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Thu, 23 May 2024 14:28:43 -
>> "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"
>>  wrote:
>> > This is a really exciting development, thank you for putting together this 
>> > proposal!
>> >
>> > It looks like this thread and the linked GitHub issue has lots of input 
>> > from folks who work with Arrow at a low level and have better familiarity 
>> > with the Arrow specifications than I do, so I'll refrain from commenting 
>> > on the technicalities of the proposal. I would, however, like to share my 
>> > perspective as an application developer that heavily uses Arrow at higher 
>> > levels for composing data systems.
>> >
>> > My main concern with the direction of this proposal is that it seems too 
>> > narrowly focused on what the integration with DuckDB will look like (how 
>> > the statistics can be fed into DuckDB). In many applications, executing 
>> > the query is often the "last mile", and it's important to consider where 
>> > the statistics will actually come from. To start, data might be sourced in 
>> > various manners:
>> >
>> > - Arrow IPC files may be mapped from shared memory
>> > - Arrow IPC streams may be received via some RPC framework (à la Flight)
>> > - The Arrow libraries may be used to read from file formats like Parquet 
>> > or CSV
>> > - ADBC drivers may be used to read from databases
>> >
>> > Note that in at least the first two cases, the system _executing the 
>> > query_ will not be able to provide statistics simply because it is not 
>> > actually the data producer. As an example, if Process A writes an Arrow 
>> > IPC file to shared memory, and Process B wants to run a query on it -- how 
>> > is Process B supposed to get the statistics for query planning? There are 
>> > a few approaches that I anticipate application 

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases

Thanks for listing it.

Regarding to the first case:

Using schema metadata may be a reasonable approach because
the Arrow data will be on the page cache. There is no
significant read cost. We don't need to read statistics
before the Arrow data is ready.

But if the Arrow data will not be produced based on
statistics of the Arrow data, separated statistics get API
may be better.

Regarding to the second case:

Schema metadata is an approach for it but we can choose
other approaches for this case. For example, Flight has
FlightData::app_metadata[1] and Arrow IPC message has
custom_metadata[2] as Dewey mentioned.

[1] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Flight.proto#L512-L515
[2] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Message.fbs#L154

Regarding to the third case:

Reader objects will provide statistics. For example,
parquet::ColumnChunkMetaData::statistics()
(parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics())
will provide statistics.

Regarding to the forth case:

We can use ADBC API.


Based on the list, how about standardizing both of the
followings for statistics?

1. Apache Arrow schema for statistics that is used by
   separated statistics getter API
2. "ARROW:statistics" metadata format that can be used in
   Apache Arrow schema metadata

Users can use 1. and/or 2. based on their use cases.

Regarding to 2.: How about the following?

This uses Field::custom_metadata[3] and
Schema::custom_metadata[4].

[3] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L528-L529
[4] 
https://github.com/apache/arrow/blob/1c9e393b73195840960dfb9eca8c0dc390be751a/format/Schema.fbs#L563-L564

"ARROW:statistics" in Field::custom_metadata represents
column-level statistics. It uses JSON like we did for
"ARROW:extension:metadata"[5]. Here is an example:

  Field {
custom_metadata: {
  "ARROW:statistics" => "{\"max\": 1, \"distinct_count\": 29}"
}
  }

(JSON may not be able to represent complex information but
is it needed for statistics?)

"ARROW:statistics" in Schema::custom_metadata represents
table-level statistics. It uses JSON like we did for
"ARROW:extension:metadata"[5]. Here is an example:

  Schema {
custom_metadata: {
  "ARROW:statistics" => "{\"row_count\": 29}"
}
  }

TODO: Define the JSON content details. For example, we need
to define keys such as "distinct_count" and "row_count".


[5] 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types



Thanks,
-- 
kou

In <664f529b0002a8710c430...@message.bloomberg.net>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
14:28:43 -,
  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  
wrote:

> This is a really exciting development, thank you for putting together this 
> proposal!
> 
> It looks like this thread and the linked GitHub issue has lots of input from 
> folks who work with Arrow at a low level and have better familiarity with the 
> Arrow specifications than I do, so I'll refrain from commenting on the 
> technicalities of the proposal. I would, however, like to share my 
> perspective as an application developer that heavily uses Arrow at higher 
> levels for composing data systems.
> 
> My main concern with the direction of this proposal is that it seems too 
> narrowly focused on what the integration with DuckDB will look like (how the 
> statistics can be fed into DuckDB). In many applications, executing the query 
> is often the "last mile", and it's important to consider where the statistics 
> will actually come from. To start, data might be sourced in various manners:
> 
> - Arrow IPC files may be mapped from shared memory
> - Arrow IPC streams may be received via some RPC framework (à la Flight)
> - The Arrow libraries may be used to read from file formats like Parquet or 
> CSV
> - ADBC drivers may be used to read from databases
> 
> Note that in at least the first two cases, the system _executing the query_ 
> will not be able to provide statistics simply because it is not actually the 
> data producer. As an example, if Process A writes an Arrow IPC file to shared 
> memory, and Process B wants to run a query on it -- how is Process B supposed 
> to get the statistics for query planning? There are a few approaches that I 
> anticipate application developers might consider:
> 
> 1. Design an out-of-band mechanism for Process B to fetch statistics from 
> Process A.
> 2. Design an encoding that is a superset of Arrow IPC and includes statistics 
> information, allowing statistics to be communicated in-band.
> 

Re: [DISCUSS] Statistics through the C data interface

2024-05-26 Thread Sutou Kouhei
Hi,

> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.

Does this mean that we will need C Data Interface +
statistics + XXX + ... for query planning and so on?

Or does this mean that ADBC like statistics schema will not
be able to cover use cases such as query planning?

If it means the former, can we provide extra mechanism at
that time?

If it means the latter, how about adding version to
statistics schema? For example, we can add
'"ARROW:statistics:version" => "1.0.0"' metadata to
statistics schema. We can define statistics schema 2.0.0
when we find a use case that isn't covered by statistics
schema 1.0.0. It doesn't break existing codes because
we can use both of statistics schema 1.0.0 and 2.0.0 at the
same time.


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
11:09:07 -0300,
  Felipe Oliveira Carvalho  wrote:

> I want to +1 on what Dewey is saying here and some comments.
> 
> Sutou Kouhei wrote:
>> ADBC may be a bit larger to use only for transmitting statistics. ADBC has 
>> statistics related APIs but it has more other APIs.
> 
> It's impossible to keep the responsibility of communication protocols
> cleanly separated, but IMO, we should strive to keep the C Data
> Interface more of a Transport Protocol than an Application Protocol.
> 
> Statistics are application dependent and can complicate the
> implementation of importers/exporters which would hinder the adoption
> of the C Data Interface. Statistics also bring in security concerns
> that are application-specific. e.g. can an algorithm trust min/max
> stats and risk producing incorrect results if the statistics are
> incorrect? A question that can't really be answered at the C Data
> Interface level.
> 
> The need for more sophisticated statistics only grows with time, so
> there is no such thing as a "simple statistics schema".
> 
> Protocols that produce/consume statistics might want to use the C Data
> Interface as a primitive for passing Arrow arrays of statistics.
> 
> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.
> 
> --
> Felipe
> 
> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
>  wrote:
>>
>> Thank you for the background! I understand that these statistics are
>> important for query planning; however, I am not sure that I follow why
>> we are constrained to the ArrowSchema to represent them. The examples
>> given seem to going through Python...would it be easier to request
>> statistics at a higher level of abstraction? There would already need
>> to be a separate mechanism to request an ArrowArrayStream with
>> statistics (unless the PyCapsule `requested_schema` argument would
>> suffice).
>>
>> > ADBC may be a bit larger to use only for transmitting
>> > statistics. ADBC has statistics related APIs but it has more
>> > other APIs.
>>
>> Some examples of producers given in the linked threads (Delta Lake,
>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> can implement an ADBC driver without defining all the methods (where
>> the producer could call AdbcConnectionGetStatistics(), although
>> AdbcStatementGetStatistics() might be more relevant here and doesn't
>> exist). One example listed (using an Arrow Table as a source) seems a
>> bit light to wrap in an ADBC driver; however, it would not take much
>> code to do so and the overhead of getting the reader via ADBC it is
>> something like 100 microseconds (tested via the ADBC R package's
>> "monkey driver" which wraps an existing stream as a statement). In any
>> case, the bulk of the code is building the statistics array.
>>
>> > How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>>
>> Whatever format for statistics is decided on, I imagine it should be
>> exactly the same as the ADBC standard? (Perhaps pushing changes
>> upstream if needed?).
>>
>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>> >
>> > Hi,
>> >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice
>> >
>> > It seems that we should use the approach because all
>> > feedback said so. How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>> >
>> > | Field Name   | Field Type| Comments |
>> > |--|---|  |
>> > | column_name  | utf8  | (1)  |
>> > | statistic_key| utf8 not null | (2)  |
>> > | statistic_value  | VALUE_SCHEMA not null |  |
>> > | statistic_is_approximate | bool not null | (3)  |
>> >
>> > 1. If null, then the statistic applies to the entire table.
>> >It's for "row_count".
>> > 2.