Re: Unsupported/Other Type

2024-05-24 Thread David Li
I've put up a draft PR here: https://github.com/apache/arrow/pull/41823

On Wed, Apr 17, 2024, at 23:34, David Li wrote:
> Yes, this would be for an extension type. 
>
> On Wed, Apr 17, 2024, at 23:25, Weston Pace wrote:
>>> people generally find use in Arrow schemas independently of concrete data.
>>
>> This makes sense.  I think we do want to encourage use of Arrow as a "type
>> system" even if there is no data involved.  And, given that we cannot
>> easily change a field's data type property to "optional" it makes sense to
>> use a dedicated type and I so I would be in favor of such a proposal (we
>> may eventually add an "unknown type" concept in Substrait as well, it's
>> come up several times, and so we could use this in that context).
>>
>> I think that I would still prefer a canonical extension type (with storage
>> type null) over a new dedicated type.
>>
>> On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou  wrote:
>>
>>>
>>> Ah! Well, I think this could be an interesting proposal, but someone
>>> should put a more formal proposal, perhaps as a draft PR.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 17/04/2024 à 11:57, David Li a écrit :
>>> > For an unsupported/other extension type.
>>> >
>>> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
>>> >> What is "this proposal"?
>>> >>
>>> >>
>>> >> Le 17/04/2024 à 10:38, David Li a écrit :
>>> >>> Should I take it that this proposal is dead in the water? While we
>>> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
>>> it might be useful to have a singular type for consumers to latch on to.
>>> >>>
>>> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>>>  I think an "Other" extension type is slightly different than an
>>>  arbitrary extension type, though: the latter may be understood
>>>  downstream but the former represents a point at which a component
>>>  explicitly declares it does not know how to handle a field. In this
>>>  example, the PostgreSQL ADBC driver might be able to provide a
>>>  representation regardless, but a different driver (or say, the JDBC
>>>  adapter, which cannot necessarily get a bytestring for an arbitrary
>>>  JDBC type) may want an Other type to signal that it would fail if
>>> asked
>>>  to provide particular columns.
>>> 
>>>  On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>>> > Depending where your Arrow-encoded data is used, either extension
>>> > types or generic field metadata are options. We have this problem in
>>> > the ADBC Postgres driver, where we can convert *most* Postgres types
>>> > to an Arrow type but there are some others where we can't or don't
>>> > know or don't implement a conversion. Currently for these we return
>>> > opaque binary (the Postgres COPY representation of the value) but put
>>> > field metadata so that a consumer can implement a workaround for an
>>> > unsupported type. It would be arguably better to have implemented
>>> this
>>> > as an extension type; however, field metadata felt like less of a
>>> > commitment when I first worked on this.
>>> >
>>> > Cheers,
>>> >
>>> > -dewey
>>> >
>>> > On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>>> >  wrote:
>>> >>
>>> >> I was using UUID as an example. It looks like extension types
>>> covers my original request.
>>> >> 
>>> >> From: Felipe Oliveira Carvalho 
>>> >> Sent: Thursday, April 11, 2024 7:15 AM
>>> >> To: dev@arrow.apache.org 
>>> >> Subject: Re: Unsupported/Other Type
>>> >>
>>> >> The OP used UUID as an example. Would that be enough or the request
>>> is for
>>> >> a flexible mechanism that allows the creation of one-off nominal
>>> types for
>>> >> very specific use-cases?
>>> >>
>>> >> —
>>> >> Felipe
>>> >>
>>> >> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 
>>> wrote:
>>> >>
>>> >>>
>>> >>> Yes, JSON and UUID are obvious candidates for new canonical
>>> extension
>>> >>> types. XML also comes to mind, but I'm not sure there's much of a
>>> use
>>> >>> case for it.
>>> >>>
>>> >>> Regards
>>> >>>
>>> >>> Antoine.
>>> >>>
>>> >>>
>>> >>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>>>  In the past we have discussed adding a canonical type for UUID
>>> and JSON.
>>> >>> I
>>>  still think this is a good idea and could improve ergonomics in
>>> >>> downstream
>>>  language bindings (e.g. by exposing JSON querying function or
>>> >>> automatically
>>>  boxing UUIDs in built-in UUID types, like the Python uuid
>>> library). Has
>>>  anyone done any work on this to anyone's knowledge?
>>> 
>>>  On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>>  wrote:
>>> 
>>> > Hi Norman,
>>> > Arrow has a concept of exten

Policy on access to ursacomputing/crossbow?

2024-05-24 Thread Jonathan Keane
Over my time with the project I've had access to the github repository
ursacomputing/crossbow to be able to manually trigger crossbow jobs. I find
it incredibly helpful when working on the extended R CI to be able to
iterate more quickly than waiting for the comment bot.

But also over the time I've used it I've been removed and then had to ask
to be readded to the organization at least twice now.

I was recently (15 May) removed from the organization once again. One, is
it possible to be added back to the repository? And two: what is the policy
around who has access and when they get removed?

-Jon


[Javascript] Issue regarding reading multiple record batches

2024-05-24 Thread Po-Wei Huang
Hello,
Is there any one in community could help review the below issue?
We found an issue that happens when reading multiple record batches from 
javascripts.
When delta dictionary is disabled, the dictionary should be refetched. But it's 
not for second record batch.
We probably could contribute back about the bug fix.

Thanks,
Po-wei


[JS] Dictionary encoded values repeating between record batches · Issue #41683 
· apache/arrow (github.com)


Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Jacob Wujciak
Great idea, thanks Ian!

On 2024/05/24 16:40:17 Ian Cook wrote:
> > Is it somehow possible to be a "member" of this account to indicate that
> > we have PMC status, or is that not possible within the LinkedIn
> > membership/permissions model?
> 
> Yes, Arrow Committers and PMC Members may wish to add Apache Arrow to the
> Experience section of their LinkedIn profiles.
> 
> To do this you must fill in three required fields:
> 1 - position
> 2 - month of start date
> 3 - year of start date
> 
> If you wish to do this, I suggest filling the position field with
> "Committer", "PMC Member", or "PMC Chair" (whichever is appropriate) and
> filling the start date according to when your membership was announced on
> the dev mailing list. In the case of PMC Members or Chairs, you can add
> multiple entries (some with end dates) to show your different positions
> over time.
> 
> Those who add Apache Arrow to the Experience section of their LinkedIn
> profiles will appear in the list of people affiliated with Apache Arrow at
> https://www.linkedin.com/company/apache-arrow/people/.
> 
> I intend to occasionally check the list of people in that list to ensure no
> one is misrepresenting themself. LinkedIn allows company admins to report
> such misrepresentations [10].
> 
> Thank you,
> Ian
> 
> [10] https://www.linkedin.com/help/linkedin/answer/a540867
> 
> On Fri, May 24, 2024 at 12:35 PM Antoine Pitrou  wrote:
> 
> >
> > Is it somehow possible to be a "member" of this account to indicate that
> > we have PMC status, or is that not possible within the LinkedIn
> > membership/permissions model?
> >
> >
> >
> > Le 24/05/2024 à 18:04, Ian Cook a écrit :
> > > Following the discussion [1] earlier this year about the status of the
> > > Apache Arrow Twitter / X account [2], I have seen several news stories
> > > citing declines in use of X and increases in use of LinkedIn (for example
> > > [3]). Anecdotally I have seen that the types of conversations about open
> > > source data technologies that in the past would have taken place on
> > Twitter
> > > are now often taking place on Linkedin.
> > >
> > > Many other ASF projects now have LinkedIn pages ([4][5][6][7] to name a
> > > few). LinkedIn does not provide a type of account or page specifically
> > for
> > > open source projects, but the norm among other ASF projects is to use
> > > LinkedIn company pages. It appears that a company page is the only
> > suitable
> > > type of LinkedIn page for this purpose.
> > >
> > > After reviewing the ASF best practices for project social media accounts
> > > [8], I created a LinkedIn company page for Apache Arrow [9]. I configured
> > > the page so that the hyphen in the URL is optional; the
> > > unhyphenated version redirects to the hyphenated version.
> > >
> > > Permissions for LinkedIn company pages are managed by assigning roles for
> > > the page to one or more LinkedIn user accounts. There are no credentials
> > to
> > > be shared. I assigned Raúl Cumplido's LinkedIn account as superadmin user
> > > for this page. Raúl is a member of the PMC and is the person who
> > currently
> > > manages our X account. My LinkedIn account is secure and I have verified
> > > with Raúl that his LinkedIn account is secure (strong password used, MFA
> > > enabled).
> > >
> > > As discussed in [1], this page could be used to cross-post the same
> > content
> > > that is currently being posted to X. For now this can be done manually;
> > > perhaps later we can use a syndication tool. This page also gives
> > LinkedIn
> > > users the ability to tag "@Apache Arrow" in their LinkedIn posts,
> > comments,
> > > and articles.
> > >
> > > I invite further discussion on the use of this LinkedIn page.
> > >
> > > Thank you,
> > > Ian
> > >
> > > [1] https://lists.apache.org/thread/nyoq9t114t2ko8f3jflzn9dzcsljvptq
> > > [2] https://x.com/ApacheArrow
> > > [3]
> > >
> > https://techcrunch.com/2024/03/28/linkedin-targets-users-caught-between-tiktok-and-twitter-world-games-videos/
> > > [4] https://www.linkedin.com/company/apacheiceberg/
> > > [5] https://www.linkedin.com/company/apache-hudi/
> > > [6] https://www.linkedin.com/company/apachespark/
> > > [7] https://www.linkedin.com/company/doris-apache/
> > > [8] https://www.apache.org/foundation/marks/socialmedia
> > > [9] https://www.linkedin.com/company/apache-arrow
> > >
> >
> 


Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Ian Cook
> Is it somehow possible to be a "member" of this account to indicate that
> we have PMC status, or is that not possible within the LinkedIn
> membership/permissions model?

Yes, Arrow Committers and PMC Members may wish to add Apache Arrow to the
Experience section of their LinkedIn profiles.

To do this you must fill in three required fields:
1 - position
2 - month of start date
3 - year of start date

If you wish to do this, I suggest filling the position field with
"Committer", "PMC Member", or "PMC Chair" (whichever is appropriate) and
filling the start date according to when your membership was announced on
the dev mailing list. In the case of PMC Members or Chairs, you can add
multiple entries (some with end dates) to show your different positions
over time.

Those who add Apache Arrow to the Experience section of their LinkedIn
profiles will appear in the list of people affiliated with Apache Arrow at
https://www.linkedin.com/company/apache-arrow/people/.

I intend to occasionally check the list of people in that list to ensure no
one is misrepresenting themself. LinkedIn allows company admins to report
such misrepresentations [10].

Thank you,
Ian

[10] https://www.linkedin.com/help/linkedin/answer/a540867

On Fri, May 24, 2024 at 12:35 PM Antoine Pitrou  wrote:

>
> Is it somehow possible to be a "member" of this account to indicate that
> we have PMC status, or is that not possible within the LinkedIn
> membership/permissions model?
>
>
>
> Le 24/05/2024 à 18:04, Ian Cook a écrit :
> > Following the discussion [1] earlier this year about the status of the
> > Apache Arrow Twitter / X account [2], I have seen several news stories
> > citing declines in use of X and increases in use of LinkedIn (for example
> > [3]). Anecdotally I have seen that the types of conversations about open
> > source data technologies that in the past would have taken place on
> Twitter
> > are now often taking place on Linkedin.
> >
> > Many other ASF projects now have LinkedIn pages ([4][5][6][7] to name a
> > few). LinkedIn does not provide a type of account or page specifically
> for
> > open source projects, but the norm among other ASF projects is to use
> > LinkedIn company pages. It appears that a company page is the only
> suitable
> > type of LinkedIn page for this purpose.
> >
> > After reviewing the ASF best practices for project social media accounts
> > [8], I created a LinkedIn company page for Apache Arrow [9]. I configured
> > the page so that the hyphen in the URL is optional; the
> > unhyphenated version redirects to the hyphenated version.
> >
> > Permissions for LinkedIn company pages are managed by assigning roles for
> > the page to one or more LinkedIn user accounts. There are no credentials
> to
> > be shared. I assigned Raúl Cumplido's LinkedIn account as superadmin user
> > for this page. Raúl is a member of the PMC and is the person who
> currently
> > manages our X account. My LinkedIn account is secure and I have verified
> > with Raúl that his LinkedIn account is secure (strong password used, MFA
> > enabled).
> >
> > As discussed in [1], this page could be used to cross-post the same
> content
> > that is currently being posted to X. For now this can be done manually;
> > perhaps later we can use a syndication tool. This page also gives
> LinkedIn
> > users the ability to tag "@Apache Arrow" in their LinkedIn posts,
> comments,
> > and articles.
> >
> > I invite further discussion on the use of this LinkedIn page.
> >
> > Thank you,
> > Ian
> >
> > [1] https://lists.apache.org/thread/nyoq9t114t2ko8f3jflzn9dzcsljvptq
> > [2] https://x.com/ApacheArrow
> > [3]
> >
> https://techcrunch.com/2024/03/28/linkedin-targets-users-caught-between-tiktok-and-twitter-world-games-videos/
> > [4] https://www.linkedin.com/company/apacheiceberg/
> > [5] https://www.linkedin.com/company/apache-hudi/
> > [6] https://www.linkedin.com/company/apachespark/
> > [7] https://www.linkedin.com/company/doris-apache/
> > [8] https://www.apache.org/foundation/marks/socialmedia
> > [9] https://www.linkedin.com/company/apache-arrow
> >
>


Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Antoine Pitrou



Is it somehow possible to be a "member" of this account to indicate that 
we have PMC status, or is that not possible within the LinkedIn 
membership/permissions model?




Le 24/05/2024 à 18:04, Ian Cook a écrit :

Following the discussion [1] earlier this year about the status of the
Apache Arrow Twitter / X account [2], I have seen several news stories
citing declines in use of X and increases in use of LinkedIn (for example
[3]). Anecdotally I have seen that the types of conversations about open
source data technologies that in the past would have taken place on Twitter
are now often taking place on Linkedin.

Many other ASF projects now have LinkedIn pages ([4][5][6][7] to name a
few). LinkedIn does not provide a type of account or page specifically for
open source projects, but the norm among other ASF projects is to use
LinkedIn company pages. It appears that a company page is the only suitable
type of LinkedIn page for this purpose.

After reviewing the ASF best practices for project social media accounts
[8], I created a LinkedIn company page for Apache Arrow [9]. I configured
the page so that the hyphen in the URL is optional; the
unhyphenated version redirects to the hyphenated version.

Permissions for LinkedIn company pages are managed by assigning roles for
the page to one or more LinkedIn user accounts. There are no credentials to
be shared. I assigned Raúl Cumplido's LinkedIn account as superadmin user
for this page. Raúl is a member of the PMC and is the person who currently
manages our X account. My LinkedIn account is secure and I have verified
with Raúl that his LinkedIn account is secure (strong password used, MFA
enabled).

As discussed in [1], this page could be used to cross-post the same content
that is currently being posted to X. For now this can be done manually;
perhaps later we can use a syndication tool. This page also gives LinkedIn
users the ability to tag "@Apache Arrow" in their LinkedIn posts, comments,
and articles.

I invite further discussion on the use of this LinkedIn page.

Thank you,
Ian

[1] https://lists.apache.org/thread/nyoq9t114t2ko8f3jflzn9dzcsljvptq
[2] https://x.com/ApacheArrow
[3]
https://techcrunch.com/2024/03/28/linkedin-targets-users-caught-between-tiktok-and-twitter-world-games-videos/
[4] https://www.linkedin.com/company/apacheiceberg/
[5] https://www.linkedin.com/company/apache-hudi/
[6] https://www.linkedin.com/company/apachespark/
[7] https://www.linkedin.com/company/doris-apache/
[8] https://www.apache.org/foundation/marks/socialmedia
[9] https://www.linkedin.com/company/apache-arrow



Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Andy Grove
Thanks, Ian. I think this is a great idea.

On Fri, May 24, 2024 at 10:05 AM Ian Cook  wrote:

> Following the discussion [1] earlier this year about the status of the
> Apache Arrow Twitter / X account [2], I have seen several news stories
> citing declines in use of X and increases in use of LinkedIn (for example
> [3]). Anecdotally I have seen that the types of conversations about open
> source data technologies that in the past would have taken place on Twitter
> are now often taking place on Linkedin.
>
> Many other ASF projects now have LinkedIn pages ([4][5][6][7] to name a
> few). LinkedIn does not provide a type of account or page specifically for
> open source projects, but the norm among other ASF projects is to use
> LinkedIn company pages. It appears that a company page is the only suitable
> type of LinkedIn page for this purpose.
>
> After reviewing the ASF best practices for project social media accounts
> [8], I created a LinkedIn company page for Apache Arrow [9]. I configured
> the page so that the hyphen in the URL is optional; the
> unhyphenated version redirects to the hyphenated version.
>
> Permissions for LinkedIn company pages are managed by assigning roles for
> the page to one or more LinkedIn user accounts. There are no credentials to
> be shared. I assigned Raúl Cumplido's LinkedIn account as superadmin user
> for this page. Raúl is a member of the PMC and is the person who currently
> manages our X account. My LinkedIn account is secure and I have verified
> with Raúl that his LinkedIn account is secure (strong password used, MFA
> enabled).
>
> As discussed in [1], this page could be used to cross-post the same content
> that is currently being posted to X. For now this can be done manually;
> perhaps later we can use a syndication tool. This page also gives LinkedIn
> users the ability to tag "@Apache Arrow" in their LinkedIn posts, comments,
> and articles.
>
> I invite further discussion on the use of this LinkedIn page.
>
> Thank you,
> Ian
>
> [1] https://lists.apache.org/thread/nyoq9t114t2ko8f3jflzn9dzcsljvptq
> [2] https://x.com/ApacheArrow
> [3]
>
> https://techcrunch.com/2024/03/28/linkedin-targets-users-caught-between-tiktok-and-twitter-world-games-videos/
> [4] https://www.linkedin.com/company/apacheiceberg/
> [5] https://www.linkedin.com/company/apache-hudi/
> [6] https://www.linkedin.com/company/apachespark/
> [7] https://www.linkedin.com/company/doris-apache/
> [8] https://www.apache.org/foundation/marks/socialmedia
> [9] https://www.linkedin.com/company/apache-arrow
>


[DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Ian Cook
Following the discussion [1] earlier this year about the status of the
Apache Arrow Twitter / X account [2], I have seen several news stories
citing declines in use of X and increases in use of LinkedIn (for example
[3]). Anecdotally I have seen that the types of conversations about open
source data technologies that in the past would have taken place on Twitter
are now often taking place on Linkedin.

Many other ASF projects now have LinkedIn pages ([4][5][6][7] to name a
few). LinkedIn does not provide a type of account or page specifically for
open source projects, but the norm among other ASF projects is to use
LinkedIn company pages. It appears that a company page is the only suitable
type of LinkedIn page for this purpose.

After reviewing the ASF best practices for project social media accounts
[8], I created a LinkedIn company page for Apache Arrow [9]. I configured
the page so that the hyphen in the URL is optional; the
unhyphenated version redirects to the hyphenated version.

Permissions for LinkedIn company pages are managed by assigning roles for
the page to one or more LinkedIn user accounts. There are no credentials to
be shared. I assigned Raúl Cumplido's LinkedIn account as superadmin user
for this page. Raúl is a member of the PMC and is the person who currently
manages our X account. My LinkedIn account is secure and I have verified
with Raúl that his LinkedIn account is secure (strong password used, MFA
enabled).

As discussed in [1], this page could be used to cross-post the same content
that is currently being posted to X. For now this can be done manually;
perhaps later we can use a syndication tool. This page also gives LinkedIn
users the ability to tag "@Apache Arrow" in their LinkedIn posts, comments,
and articles.

I invite further discussion on the use of this LinkedIn page.

Thank you,
Ian

[1] https://lists.apache.org/thread/nyoq9t114t2ko8f3jflzn9dzcsljvptq
[2] https://x.com/ApacheArrow
[3]
https://techcrunch.com/2024/03/28/linkedin-targets-users-caught-between-tiktok-and-twitter-world-games-videos/
[4] https://www.linkedin.com/company/apacheiceberg/
[5] https://www.linkedin.com/company/apache-hudi/
[6] https://www.linkedin.com/company/apachespark/
[7] https://www.linkedin.com/company/doris-apache/
[8] https://www.apache.org/foundation/marks/socialmedia
[9] https://www.linkedin.com/company/apache-arrow


Re: [DISCUSS] Drop Java 8 support

2024-05-24 Thread Weston Pace
No vote is required from an ASF perspective (this is not a release)
No vote is required from Arrow conventions (this is not a spec change and
does not impact more than one implementation)

I will send a message to the parquet ML to solicit feedback.

On Fri, May 24, 2024 at 8:22 AM Laurent Goujon 
wrote:

> I would say so because it is akin to removing a large feature but maybe
> some PMC can chime in?
>
> Laurent
>
> On Tue, May 21, 2024 at 12:16 PM Dane Pitkin  wrote:
>
> > I haven't been active in Apache Parquet, but I did not see any prior
> > discussions on this topic in their Jira or dev mailing list.
> >
> > Do we think a vote is needed before officially moving forward with Java 8
> > deprecation?
> >
> > On Mon, May 20, 2024 at 12:50 PM Laurent Goujon
>  > >
> > wrote:
> >
> > > I also mentioned Apache Parquet and haven't seen someone mentioned
> > if/when
> > > Apache Parquet would transition.
> > >
> > >
> > >
> > > On Fri, May 17, 2024 at 9:07 AM Dane Pitkin 
> wrote:
> > >
> > > > Fokko, thank you for these datapoints! It's great to see how other
> low
> > > > level Java OSS projects are approaching this.
> > > >
> > > > JB, I believe yes we have formal consensus to drop Java 8 in Arrow.
> > There
> > > > was no contention in current discussions across [GitHub issues |
> Arrow
> > > > Mailing List | Community Syncs].
> > > >
> > > > We can save Java 11 deprecation for a future discussion. For users on
> > > Java
> > > > 11, I do anticipate this discussion to come shortly after Java 8
> > > > deprecation is released.
> > > >
> > > > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong 
> > > > wrote:
> > > >
> > > > > I was traveling the last few weeks, so just a follow-up from my
> end.
> > > > >
> > > > > Fokko, can you elaborate on the discussions held in other OSS
> > projects
> > > to
> > > > >> drop Java <17? How did they weigh the benefits/drawbacks for
> > dropping
> > > > both
> > > > >> Java 8 and 11 LTS versions? I'd also be curious if other projects
> > plan
> > > > to
> > > > >> support older branches with security patches.
> > > > >
> > > > >
> > > > > So, the ones that I'm involved with (including a TLDR):
> > > > >
> > > > >- Avro:
> > > > >   - (April 2024: Consensus on moving to 11+, +1 for moving to
> > 17+)
> > > > >
> > https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > > > >   - (Jan 2024: Consensus on dropping 8)
> > > > >
> > https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > > > >   - Iceberg:
> > > > >   - (Jan 2023: Concerns about Hive):
> > > > >
> > https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > > > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to
> JDK11+,
> > > > >   also +1's for moving to 17+):
> > > > >
> > https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > > > >
> > > > > I think the most noteworthy (slow-moving in general):
> > > > >
> > > > >- Spark 4 supports JDK 17+
> > > > >- Hive 4 is still on Java 8
> > > > >
> > > > >
> > > > > It looks like most of the projects are looking at each other. Keep
> in
> > > > > mind, that projects that still support older versions of Java, can
> > > still
> > > > > use older versions of Arrow.
> > > > >
> > > > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > > > (in case the image doesn't come through, that's Spiderman pointing
> at
> > > > > Spiderman)
> > > > >
> > > > > Concerning the Java 11 support, some data:
> > > > >
> > > > >- Oracle 11: support until January 2032 (extended fee has been
> > > waived)
> > > > >- Cornetto 11: September 2027
> > > > >- Adoptium 11: At least Oct 2027
> > > > >- Zulu 11: Jan 2032
> > > > >- OpenJDK11: October 2024
> > > > >
> > > > > I think it is fair to support 11 for the time being, but at some
> > point,
> > > > we
> > > > > also have to move on and start exploiting the new features and make
> > > sure
> > > > > that we keep up to date. For example, Java 8 also has extended
> > support
> > > > > until 2030. Dependabot on the Iceberg project
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3Adependencies
> > > > >
> > > > > nicely shows which projects are already at JDK11+ :)
> > > > >
> > > > > Thanks Dane for driving this!
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Op vr 17 mei 2024 om 07:44 schreef Jean-Baptiste Onofré <
> > > j...@nanthrax.net
> > > > >:
> > > > >
> > > > >> Hi Dane
> > > > >>
> > > > >> Do we have a formal consensus about Java version in regards of
> arrow
> > > > >> version ?
> > > > >> I agree with the plan but just wondering if it’s ok from everyone
> > with
> > > > the
> > > > >> community.
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >> Le jeu. 16 mai 2024 à 18:05, Dane Pitkin  a
> > > écrit :
> > > > >>
> > > > >> > To wra

Re: [DISCUSS] Drop Java 8 support

2024-05-24 Thread Laurent Goujon
I would say so because it is akin to removing a large feature but maybe
some PMC can chime in?

Laurent

On Tue, May 21, 2024 at 12:16 PM Dane Pitkin  wrote:

> I haven't been active in Apache Parquet, but I did not see any prior
> discussions on this topic in their Jira or dev mailing list.
>
> Do we think a vote is needed before officially moving forward with Java 8
> deprecation?
>
> On Mon, May 20, 2024 at 12:50 PM Laurent Goujon  >
> wrote:
>
> > I also mentioned Apache Parquet and haven't seen someone mentioned
> if/when
> > Apache Parquet would transition.
> >
> >
> >
> > On Fri, May 17, 2024 at 9:07 AM Dane Pitkin  wrote:
> >
> > > Fokko, thank you for these datapoints! It's great to see how other low
> > > level Java OSS projects are approaching this.
> > >
> > > JB, I believe yes we have formal consensus to drop Java 8 in Arrow.
> There
> > > was no contention in current discussions across [GitHub issues | Arrow
> > > Mailing List | Community Syncs].
> > >
> > > We can save Java 11 deprecation for a future discussion. For users on
> > Java
> > > 11, I do anticipate this discussion to come shortly after Java 8
> > > deprecation is released.
> > >
> > > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong 
> > > wrote:
> > >
> > > > I was traveling the last few weeks, so just a follow-up from my end.
> > > >
> > > > Fokko, can you elaborate on the discussions held in other OSS
> projects
> > to
> > > >> drop Java <17? How did they weigh the benefits/drawbacks for
> dropping
> > > both
> > > >> Java 8 and 11 LTS versions? I'd also be curious if other projects
> plan
> > > to
> > > >> support older branches with security patches.
> > > >
> > > >
> > > > So, the ones that I'm involved with (including a TLDR):
> > > >
> > > >- Avro:
> > > >   - (April 2024: Consensus on moving to 11+, +1 for moving to
> 17+)
> > > >
> https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > > >   - (Jan 2024: Consensus on dropping 8)
> > > >
> https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > > >   - Iceberg:
> > > >   - (Jan 2023: Concerns about Hive):
> > > >
> https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to JDK11+,
> > > >   also +1's for moving to 17+):
> > > >
> https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > > >
> > > > I think the most noteworthy (slow-moving in general):
> > > >
> > > >- Spark 4 supports JDK 17+
> > > >- Hive 4 is still on Java 8
> > > >
> > > >
> > > > It looks like most of the projects are looking at each other. Keep in
> > > > mind, that projects that still support older versions of Java, can
> > still
> > > > use older versions of Arrow.
> > > >
> > > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > > (in case the image doesn't come through, that's Spiderman pointing at
> > > > Spiderman)
> > > >
> > > > Concerning the Java 11 support, some data:
> > > >
> > > >- Oracle 11: support until January 2032 (extended fee has been
> > waived)
> > > >- Cornetto 11: September 2027
> > > >- Adoptium 11: At least Oct 2027
> > > >- Zulu 11: Jan 2032
> > > >- OpenJDK11: October 2024
> > > >
> > > > I think it is fair to support 11 for the time being, but at some
> point,
> > > we
> > > > also have to move on and start exploiting the new features and make
> > sure
> > > > that we keep up to date. For example, Java 8 also has extended
> support
> > > > until 2030. Dependabot on the Iceberg project
> > > > <
> > >
> >
> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3Adependencies
> > > >
> > > > nicely shows which projects are already at JDK11+ :)
> > > >
> > > > Thanks Dane for driving this!
> > > >
> > > > Kind regards,
> > > > Fokko
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Op vr 17 mei 2024 om 07:44 schreef Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >:
> > > >
> > > >> Hi Dane
> > > >>
> > > >> Do we have a formal consensus about Java version in regards of arrow
> > > >> version ?
> > > >> I agree with the plan but just wondering if it’s ok from everyone
> with
> > > the
> > > >> community.
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >> Le jeu. 16 mai 2024 à 18:05, Dane Pitkin  a
> > écrit :
> > > >>
> > > >> > To wrap up this thread on Java 8 deprecation, here is my current
> > plan
> > > of
> > > >> > action:
> > > >> >
> > > >> > 1) Arrow v17 will be the last version supporting Java 8 and the
> > > release
> > > >> > notes will warn of its impending deprecation.
> > > >> > 2) Arrow v18 will be the first release supporting min version Java
> > 11.
> > > >> >
> > > >> > I have updated the GH issue[1] to reflect this.
> > > >> >
> > > >> > [1]https://github.com/apache/arrow/issues/38051
> > > >> >
> > > >> > On Wed, May 8, 2024 at 5:46 PM Dane Pitkin
> > >  > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > Thank 

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Weston Pace
> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

This has been something that has always been desired for the Arrow IPC
format too.

My preference would be (apologies if this has been mentioned before):

- Agree on how statistics should be encoded into an array (this is not
hard, we just have to agree on the field order and the data type for
null_count)
- If you need statistics in the schema then simply encode the 1-row batch
into an IPC buffer (using the streaming format) or maybe just an IPC
RecordBatch message since the schema is fixed and store those bytes in the
schema



On Fri, May 24, 2024 at 1:20 AM Sutou Kouhei  wrote:

> Hi,
>
> Could you explain more about your idea? Does it propose that
> we add more callbacks to ArrowArrayStream such as
> ArrowArrayStream::get_statistics()? Or Does it propose that
> we define one more Arrow C XXX interface that wraps
> ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
>
> ArrowDeviceArray:
> https://arrow.apache.org/docs/format/CDeviceDataInterface.html
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May
> 2024 06:55:40 -0700,
>   Curt Hagenlocher  wrote:
>
> >>  would it be easier to request statistics at a higher level of
> > abstraction?
> >
> > What if there were a "single table provider" level of abstraction between
> > ADBC and ArrowArrayStream as a C API; something that can report
> statistics
> > and apply simple predicates?
> >
> > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
> >  wrote:
> >
> >> Thank you for the background! I understand that these statistics are
> >> important for query planning; however, I am not sure that I follow why
> >> we are constrained to the ArrowSchema to represent them. The examples
> >> given seem to going through Python...would it be easier to request
> >> statistics at a higher level of abstraction? There would already need
> >> to be a separate mechanism to request an ArrowArrayStream with
> >> statistics (unless the PyCapsule `requested_schema` argument would
> >> suffice).
> >>
> >> > ADBC may be a bit larger to use only for transmitting
> >> > statistics. ADBC has statistics related APIs but it has more
> >> > other APIs.
> >>
> >> Some examples of producers given in the linked threads (Delta Lake,
> >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> >> can implement an ADBC driver without defining all the methods (where
> >> the producer could call AdbcConnectionGetStatistics(), although
> >> AdbcStatementGetStatistics() might be more relevant here and doesn't
> >> exist). One example listed (using an Arrow Table as a source) seems a
> >> bit light to wrap in an ADBC driver; however, it would not take much
> >> code to do so and the overhead of getting the reader via ADBC it is
> >> something like 100 microseconds (tested via the ADBC R package's
> >> "monkey driver" which wraps an existing stream as a statement). In any
> >> case, the bulk of the code is building the statistics array.
> >>
> >> > How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >>
> >> Whatever format for statistics is decided on, I imagine it should be
> >> exactly the same as the ADBC standard? (Perhaps pushing changes
> >> upstream if needed?).
> >>
> >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > > Why not simply pass the statistics ArrowArray separately in your
> >> > > producer API of choice
> >> >
> >> > It seems that we should use the approach because all
> >> > feedback said so. How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >> >
> >> > | Field Name   | Field Type| Comments |
> >> > |--|---|  |
> >> > | column_name  | utf8  | (1)  |
> >> > | statistic_key| utf8 not null | (2)  |
> >> > | statistic_value  | VALUE_SCHEMA not null |  |
> >> > | statistic_is_approximate | bool not null | (3)  |
> >> >
> >> > 1. If null, then the statistic applies to the entire table.
> >> >It's for "row_count".
> >> > 2. We'll provide pre-defined keys such as "max", "min",
> >> >"byte_width" and "distinct_count" but users can also use
> >> >application specific keys.
> >> > 3. If true, then the value is approximate or best-effort.
> >> >
> >> > VALUE_SCHEMA is a dense union with members:
> >> >
> >> > | Field Name | Field Type |
> >> > |||
> >> > | int64  | int64  |
> >> > | uint64 | uint64 |
> >> > | float64| float64|
> >> > | binary | binary |
> >> >
> >> > If a column is an int32 column, it uses int64 for
> >> > "max"/"min". We don't provide all types here. Users should
> >> > use a compatible type (int64 for a int32 column) in

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
Hi,

Could you explain more about your idea? Does it propose that
we add more callbacks to ArrowArrayStream such as
ArrowArrayStream::get_statistics()? Or Does it propose that
we define one more Arrow C XXX interface that wraps
ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?

ArrowDeviceArray:
https://arrow.apache.org/docs/format/CDeviceDataInterface.html


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
06:55:40 -0700,
  Curt Hagenlocher  wrote:

>>  would it be easier to request statistics at a higher level of
> abstraction?
> 
> What if there were a "single table provider" level of abstraction between
> ADBC and ArrowArrayStream as a C API; something that can report statistics
> and apply simple predicates?
> 
> On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
>  wrote:
> 
>> Thank you for the background! I understand that these statistics are
>> important for query planning; however, I am not sure that I follow why
>> we are constrained to the ArrowSchema to represent them. The examples
>> given seem to going through Python...would it be easier to request
>> statistics at a higher level of abstraction? There would already need
>> to be a separate mechanism to request an ArrowArrayStream with
>> statistics (unless the PyCapsule `requested_schema` argument would
>> suffice).
>>
>> > ADBC may be a bit larger to use only for transmitting
>> > statistics. ADBC has statistics related APIs but it has more
>> > other APIs.
>>
>> Some examples of producers given in the linked threads (Delta Lake,
>> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
>> can implement an ADBC driver without defining all the methods (where
>> the producer could call AdbcConnectionGetStatistics(), although
>> AdbcStatementGetStatistics() might be more relevant here and doesn't
>> exist). One example listed (using an Arrow Table as a source) seems a
>> bit light to wrap in an ADBC driver; however, it would not take much
>> code to do so and the overhead of getting the reader via ADBC it is
>> something like 100 microseconds (tested via the ADBC R package's
>> "monkey driver" which wraps an existing stream as a statement). In any
>> case, the bulk of the code is building the statistics array.
>>
>> > How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>>
>> Whatever format for statistics is decided on, I imagine it should be
>> exactly the same as the ADBC standard? (Perhaps pushing changes
>> upstream if needed?).
>>
>> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>> >
>> > Hi,
>> >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice
>> >
>> > It seems that we should use the approach because all
>> > feedback said so. How about the following schema for the
>> > statistics ArrowArray? It's based on ADBC.
>> >
>> > | Field Name   | Field Type| Comments |
>> > |--|---|  |
>> > | column_name  | utf8  | (1)  |
>> > | statistic_key| utf8 not null | (2)  |
>> > | statistic_value  | VALUE_SCHEMA not null |  |
>> > | statistic_is_approximate | bool not null | (3)  |
>> >
>> > 1. If null, then the statistic applies to the entire table.
>> >It's for "row_count".
>> > 2. We'll provide pre-defined keys such as "max", "min",
>> >"byte_width" and "distinct_count" but users can also use
>> >application specific keys.
>> > 3. If true, then the value is approximate or best-effort.
>> >
>> > VALUE_SCHEMA is a dense union with members:
>> >
>> > | Field Name | Field Type |
>> > |||
>> > | int64  | int64  |
>> > | uint64 | uint64 |
>> > | float64| float64|
>> > | binary | binary |
>> >
>> > If a column is an int32 column, it uses int64 for
>> > "max"/"min". We don't provide all types here. Users should
>> > use a compatible type (int64 for a int32 column) instead.
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> > In 
>> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
>> 2024 17:04:57 +0200,
>> >   Antoine Pitrou  wrote:
>> >
>> > >
>> > > Hi Kou,
>> > >
>> > > I agree that Dewey that this is overstretching the capabilities of the
>> > > C Data Interface. In particular, stuffing a pointer as metadata value
>> > > and decreeing it immortal doesn't sound like a good design decision.
>> > >
>> > > Why not simply pass the statistics ArrowArray separately in your
>> > > producer API of choice (Dewey mentioned ADBC but it is of course just
>> > > a possible API among others)?
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >
>> > >
>> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> > >> Hi,
>> > >> We're discussing how to provide statistics through the C
>> > >> data interface at:
>> > >> https://github.com/apache/arrow/issues/38837

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Sutou Kouhei
Hi,

>I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them.

Ah, sorry. Using ArrowSchema isn't required. It's just one
idea. We can choose another approach like we just define a
schema for statistics ArrowArray as I proposed.

> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).

I think that we can use simpler one for this. For example,
ADBC uses dictionary encoding like approach for statistics
key. It requires additional ID and name mapping for
application-specific statistics key. We can use just name
for it.
See also the related discussion on the issue:
https://github.com/apache/arrow/issues/38837#issuecomment-2108895904


Thanks,
-- 
kou

In 
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May 2024 
09:57:05 -0300,
  Dewey Dunnington  wrote:

> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
> 
>> ADBC may be a bit larger to use only for transmitting
>> statistics. ADBC has statistics related APIs but it has more
>> other APIs.
> 
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
> 
>> How about the following schema for the
>> statistics ArrowArray? It's based on ADBC.
> 
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
> 
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
>>
>> Hi,
>>
>> > Why not simply pass the statistics ArrowArray separately in your
>> > producer API of choice
>>
>> It seems that we should use the approach because all
>> feedback said so. How about the following schema for the
>> statistics ArrowArray? It's based on ADBC.
>>
>> | Field Name   | Field Type| Comments |
>> |--|---|  |
>> | column_name  | utf8  | (1)  |
>> | statistic_key| utf8 not null | (2)  |
>> | statistic_value  | VALUE_SCHEMA not null |  |
>> | statistic_is_approximate | bool not null | (3)  |
>>
>> 1. If null, then the statistic applies to the entire table.
>>It's for "row_count".
>> 2. We'll provide pre-defined keys such as "max", "min",
>>"byte_width" and "distinct_count" but users can also use
>>application specific keys.
>> 3. If true, then the value is approximate or best-effort.
>>
>> VALUE_SCHEMA is a dense union with members:
>>
>> | Field Name | Field Type |
>> |||
>> | int64  | int64  |
>> | uint64 | uint64 |
>> | float64| float64|
>> | binary | binary |
>>
>> If a column is an int32 column, it uses int64 for
>> "max"/"min". We don't provide all types here. Users should
>> use a compatible type (int64 for a int32 column) instead.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May 
>> 2024 17:04:57 +0200,
>>   Antoine Pitrou  wrote:
>>
>> >
>> > Hi Kou,
>> >
>> > I agree that Dewey that this is overstretching the capabilities of the
>> > C Data Interface. In particular, stuffing a pointer as metadata value
>> > and decreeing it immortal doesn't sound like a good design decision.
>> >
>> > Why not simply pass the statistics ArrowArray separately in your
>> > producer API of choice (Dewey mentioned ADBC but it is of course just
>> > a possible API among others)?
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
>> >> Hi,
>> >> We're discussing how to provide statistics through the C
>> >> data interface at:
>> >> https://github.com/apache/arrow/issues/38837
>> >> If you're interested in this feature, could