[DISCUSS] Statistics through the C data interface

2024-05-21 Thread Sutou Kouhei
Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

> The ARROW pattern is a reserved namespace for internal
> Arrow use in the custom_metadata fields. For example,
> ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
  .format = "+siu",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
  },
  .children = {
ArrowSchema {
  .name = "column1",
  .format = "i",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
  },
},
ArrowSchema {
  .name = "column2",
  .format = "u",
  .metadata = {
"ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
  },
},
  },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
   "byte_width" and "distinct_count" but users can also use
   application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

   TODO: Is "value" good name? If we refer it from the
   top-level statistics schema, we need to use
   "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,
-- 
kou


Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-21 Thread David Li
[x] Close the GitHub milestone/project
[x] Add the new release to the Apache Reporter System
[x] Upload source release artifacts to Subversion
[x] Create the final GitHub release
[x] Update website
[x] Upload wheels/sdist to PyPI
[x] Publish Maven packages
[x] Update tags for Go modules
[x] Deploy APT/Yum repositories
[ ] Update R packages
[x] Upload Ruby packages to RubyGems
[x] Upload C#/.NET packages to NuGet
[x] Update conda-forge packages
[x] Announce the new release
[x] Remove old artifacts
[x] Bump versions
[IN PROGRESS] Publish release blog post [2]

@Dewey, I'd appreciate your help as always with the R packages :)

[1]: https://github.com/apache/arrow-site/pull/523

On Tue, May 21, 2024, at 09:00, Sutou Kouhei wrote:
> +1 (binding)
>
> I ran the following on Debian GNU/Linux sid:
>
>   TEST_DEFAULT=0 \
> TEST_SOURCE=1 \
> LANG=C \
> TZ=UTC \
> JAVA_HOME=/usr/lib/jvm/default-java \
> dev/release/verify-release-candidate.sh 12 4
>
>   TEST_DEFAULT=0 \
> TEST_APT=1 \
> LANG=C \
> dev/release/verify-release-candidate.sh 12 4
>
>   TEST_DEFAULT=0 \
> TEST_BINARY=1 \
> LANG=C \
> dev/release/verify-release-candidate.sh 12 4
>
>   TEST_DEFAULT=0 \
> TEST_JARS=1 \
> LANG=C \
> dev/release/verify-release-candidate.sh 12 4
>
>   TEST_DEFAULT=0 \
> TEST_WHEELS=1 \
> TEST_PYTHON_VERSIONS=3.11 \
> LANG=C \
> TZ=UTC \
> dev/release/verify-release-candidate.sh 12 4
>
>   TEST_DEFAULT=0 \
> TEST_YUM=1 \
> LANG=C \
> dev/release/verify-release-candidate.sh 12 4
>
> with:
>
>   * g++ (Debian 13.2.0-23) 13.2.0
>   * go version go1.22.2 linux/amd64
>   * openjdk version "17.0.11" 2024-04-16
>   * Python 3.11.9
>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
>   * R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
>   * Apache Arrow 17.0.0-SNAPSHOT
>
> Note:
>
> I needed to install arrow-glib-devel explicitly to verify
> Yum repository:
>
> 
> diff --git a/dev/release/verify-yum.sh b/dev/release/verify-yum.sh
> index f7f023611..ff30176f1 100755
> --- a/dev/release/verify-yum.sh
> +++ b/dev/release/verify-yum.sh
> @@ -170,6 +170,7 @@ echo "::endgroup::"
> 
>  echo "::group::Test ADBC Arrow GLib"
> 
> +${install_command} --enablerepo=epel arrow-glib-devel
>  ${install_command} --enablerepo=epel adbc-arrow-glib-devel-${package_version}
>  ${install_command} --enablerepo=epel adbc-arrow-glib-doc-${package_version}
> 
> 
>
> adbc-arrow-glib-devel depends on "pkgconfig(arrow-glib)" and
> libarrow-glib-devel provided by EPEL also provides it:
>
> $ sudo dnf repoquery --deplist adbc-arrow-glib-devel-12
> Last metadata expiration check: 2:01:21 ago on Mon May 20 21:17:44 2024.
> package: adbc-arrow-glib-devel-12-1.el9.x86_64
> ...
>   dependency: pkgconfig(arrow-glib)
>provider: arrow-glib-devel-16.1.0-1.el9.x86_64
>provider: libarrow-glib-devel-9.0.0-11.el9.x86_64
> ...
>
>
> If I don't install arrow-glib-devel explicitly,
> libarrow-glib-devel may be installed. We may need to add
> "Conflicts: libarrow-glib-devel" to Apache Arrow's
> arrow-glib-devel to resolve this case automatically. Anyway,
> this is not a ADBC problem. So it's not a blocker.
>
>
>
> Thanks,
> -- 
> kou
>
>
> In 
>   "[VOTE] Release Apache Arrow ADBC 12 - RC4" on Wed, 15 May 2024 
> 14:00:33 +0900,
>   "David Li"  wrote:
>
>> Hello,
>> 
>> I would like to propose the following release candidate (RC4) of Apache 
>> Arrow ADBC version 12. This is a release consisting of 56 resolved GitHub 
>> issues [1].
>> 
>> Please note that the versioning scheme has changed.  This is the 12th 
>> release of ADBC, and so is called version "12".  The subcomponents, however, 
>> are versioned independently:
>> 
>> - C/C++/GLib/Go/Python/Ruby: 1.0.0
>> - C#: 0.12.0
>> - Java: 0.12.0
>> - R: 0.12.0
>> - Rust: 0.12.0
>> 
>> These are the versions you will see in the source and in actual packages.  
>> The next release will be "13", and the subcomponents will increment their 
>> versions independently (to either 1.1.0, 0.13.0, or 1.0.0).  At this point, 
>> there is no plan to release subcomponents independently from the project as 
>> a whole. 
>> 
>> Please note that there is a known issue when using the Flight SQL and 
>> Snowflake drivers at the same time on x86_64 macOS [12].
>> 
>> This release candidate is based on commit: 
>> 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
>> 
>> The source release rc4 is hosted at [3].
>> The binary artifacts are hosted at [4][5][6][7][8].
>> The changelog is located at [9].
>> 
>> Please download, verify checksums and signatures, run the unit tests, and 
>> vote on the release. See [10] for how to validate a release candidate.
>> 
>> See also a verification result on GitHub Actions [11].
>> 
>> The vote will be open for at least 72 hours.
>> 
>> [ ] +1 Release this as Apache Arrow ADBC 12
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow ADBC 12 because...
>> 
>> Note: to verify APT/YUM packages on 

Arrow community meeting May 22 at 16:00 UTC

2024-05-21 Thread Ian Cook
Our next biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the document to
add the topics that you would like to discuss.

Thanks,
Ian


Re: [DISCUSS] Drop Java 8 support

2024-05-21 Thread Dane Pitkin
I haven't been active in Apache Parquet, but I did not see any prior
discussions on this topic in their Jira or dev mailing list.

Do we think a vote is needed before officially moving forward with Java 8
deprecation?

On Mon, May 20, 2024 at 12:50 PM Laurent Goujon 
wrote:

> I also mentioned Apache Parquet and haven't seen someone mentioned if/when
> Apache Parquet would transition.
>
>
>
> On Fri, May 17, 2024 at 9:07 AM Dane Pitkin  wrote:
>
> > Fokko, thank you for these datapoints! It's great to see how other low
> > level Java OSS projects are approaching this.
> >
> > JB, I believe yes we have formal consensus to drop Java 8 in Arrow. There
> > was no contention in current discussions across [GitHub issues | Arrow
> > Mailing List | Community Syncs].
> >
> > We can save Java 11 deprecation for a future discussion. For users on
> Java
> > 11, I do anticipate this discussion to come shortly after Java 8
> > deprecation is released.
> >
> > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong 
> > wrote:
> >
> > > I was traveling the last few weeks, so just a follow-up from my end.
> > >
> > > Fokko, can you elaborate on the discussions held in other OSS projects
> to
> > >> drop Java <17? How did they weigh the benefits/drawbacks for dropping
> > both
> > >> Java 8 and 11 LTS versions? I'd also be curious if other projects plan
> > to
> > >> support older branches with security patches.
> > >
> > >
> > > So, the ones that I'm involved with (including a TLDR):
> > >
> > >- Avro:
> > >   - (April 2024: Consensus on moving to 11+, +1 for moving to 17+)
> > >   https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > >   - (Jan 2024: Consensus on dropping 8)
> > >   https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > >   - Iceberg:
> > >   - (Jan 2023: Concerns about Hive):
> > >   https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to JDK11+,
> > >   also +1's for moving to 17+):
> > >   https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > >
> > > I think the most noteworthy (slow-moving in general):
> > >
> > >- Spark 4 supports JDK 17+
> > >- Hive 4 is still on Java 8
> > >
> > >
> > > It looks like most of the projects are looking at each other. Keep in
> > > mind, that projects that still support older versions of Java, can
> still
> > > use older versions of Arrow.
> > >
> > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > (in case the image doesn't come through, that's Spiderman pointing at
> > > Spiderman)
> > >
> > > Concerning the Java 11 support, some data:
> > >
> > >- Oracle 11: support until January 2032 (extended fee has been
> waived)
> > >- Cornetto 11: September 2027
> > >- Adoptium 11: At least Oct 2027
> > >- Zulu 11: Jan 2032
> > >- OpenJDK11: October 2024
> > >
> > > I think it is fair to support 11 for the time being, but at some point,
> > we
> > > also have to move on and start exploiting the new features and make
> sure
> > > that we keep up to date. For example, Java 8 also has extended support
> > > until 2030. Dependabot on the Iceberg project
> > > <
> >
> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3Adependencies
> > >
> > > nicely shows which projects are already at JDK11+ :)
> > >
> > > Thanks Dane for driving this!
> > >
> > > Kind regards,
> > > Fokko
> > >
> > >
> > >
> > >
> > >
> > > Op vr 17 mei 2024 om 07:44 schreef Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >:
> > >
> > >> Hi Dane
> > >>
> > >> Do we have a formal consensus about Java version in regards of arrow
> > >> version ?
> > >> I agree with the plan but just wondering if it’s ok from everyone with
> > the
> > >> community.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> Le jeu. 16 mai 2024 à 18:05, Dane Pitkin  a
> écrit :
> > >>
> > >> > To wrap up this thread on Java 8 deprecation, here is my current
> plan
> > of
> > >> > action:
> > >> >
> > >> > 1) Arrow v17 will be the last version supporting Java 8 and the
> > release
> > >> > notes will warn of its impending deprecation.
> > >> > 2) Arrow v18 will be the first release supporting min version Java
> 11.
> > >> >
> > >> > I have updated the GH issue[1] to reflect this.
> > >> >
> > >> > [1]https://github.com/apache/arrow/issues/38051
> > >> >
> > >> > On Wed, May 8, 2024 at 5:46 PM Dane Pitkin
> >  > >> >
> > >> > wrote:
> > >> >
> > >> > > Thank you all for your valuable input. The consensus from my
> > >> > understanding
> > >> > > is that dropping Java 8 is not contentious, so we will move
> forward
> > >> here.
> > >> > >
> > >> > > We won't drop Java 11 yet, but there's a chance it will happen
> > sooner
> > >> > than
> > >> > > later. I brought up Java 8 & 11 deprecation in the community sync
> > >> again
> > >> > > today. The summary is that the ASF could be enforcing