Re: Community Over Code NA next week - Data Engineering track (with Security twist)

2024-10-04 Thread Antoine Pitrou
I see that there's a European variant of that event which seems more adapted for at least some of the Arrow development community: https://eu.communityovercode.org/ Le 04/10/2024 à 10:50, Raúl Cumplido a écrit : Hi Jarek, It seems really interesting, I won't be able to attend. Do you know

[Discuss][C++] Deprecate precompiled headers option?

2024-10-02 Thread Antoine Pitrou
Hello, Long ago, we added a ARROW_USE_PRECOMPILED_HEADERS to the Arrow C++ CMake options in the hope of speeding up builds by reducing C++ header parsing time. However, we later started to use a concurrent (*) solution added in CMake itself: CMAKE_UNITY_BUILD, which merges batches of sourc

Re: [ANNOUNCE] New Arrow committer: Will Ayd

2024-10-01 Thread Antoine Pitrou
Hello Will, and thanks a lot for your involvement! Le 01/10/2024 à 18:55, Dewey Dunnington a écrit : On behalf of the Arrow PMC, I'm happy to announce that Will Wyd has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions! -dewey

Re: [DISCUSS][C++] Can we use "0E+1" not "0.E+1" for deciaml for broader compatibility?

2024-10-01 Thread Antoine Pitrou
Hi Kou, That sounds fine to me. Regards Antoine. Le 01/10/2024 à 03:55, Sutou Kouhei a écrit : Hi, The current decimal implementation omits the fractional part if the fractional part is 0. For example: "0.E+1" not "0.0E+1" Most environments such as Python, Node.js, PostgreSQL and MySQL a

Re: [CROWDSOURCING] Arrow board report due October 9

2024-09-30 Thread Antoine Pitrou
*they receive Le 30/09/2024 à 11:57, Antoine Pitrou a écrit : There might be a misunderstanding, but this is a report for the Apache Software Foundation (they recent reports from hundreds of projects). It's not really useful to copy our release notes there. Regards Antoine. Le

Re: [CROWDSOURCING] Arrow board report due October 9

2024-09-30 Thread Antoine Pitrou
There might be a misunderstanding, but this is a report for the Apache Software Foundation (they recent reports from hundreds of projects). It's not really useful to copy our release notes there. Regards Antoine. Le 30/09/2024 à 11:46, Vibhatha Abeykoon a écrit : Hi Andy, Thanks for sha

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

2024-09-13 Thread Antoine Pitrou
se's (Databend, Doris, Druid, DeepLake, Firebolt, Lance, Oxla, Pinot, QuestDB, SingleStore, etc.) native at-rest partition file formats. On Fri, 13 Sept 2024 at 16:43, Antoine Pitrou wrote: Hello, I'm perplexed by this discussion. If you want to send highly-compressed files over

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

2024-09-13 Thread Antoine Pitrou
Hello, I'm perplexed by this discussion. If you want to send highly-compressed files over the network that is already possible: just send Parquet files via HTTP(S) (or another protocol of choice). Arrow Flight is simply a *streaming* protocol that allows sending/requesting the Arrow format over

Re: [DISCUSS][C++] Should we disallow storage account key in Azure file system URL?

2024-09-12 Thread Antoine Pitrou
Hi, I sympathize with the security argument. If no other library allows for embedding the Azure password directly in the URL, then I would be ok for deprecating it. Regards Antoine. Le 10/09/2024 à 03:24, Sutou Kouhei a écrit : Hi, The current Azure file system URI accepts account key

Re: [DISCUSS] Monorepo GitHub workflow: allow one issue with multiple PRs

2024-09-12 Thread Antoine Pitrou
Hi, I don't have a specific opinion on this, but as a data point, this already happens from time to time (though rarely). Regards Antoine. Le 11/09/2024 à 17:32, Joris Van den Bossche a écrit : Hi all, This is a discussion specifically for the GitHub development workflow we use in the m

Re: [VOTE] Allow Decimal32 and Decimal64 bitwidths in Arrow Format

2024-09-05 Thread Antoine Pitrou
+1 (binding). Can you open a PR with the spec updates? Regards Antoine. Le 04/09/2024 à 23:17, Matt Topol a écrit : Based on various discussions among the ecosystem and to continue expanding the zero-copy interoperability for Arrow to be used with different libraries and databases (such as

Re: [DISCUSS][C++] Indent #if (preprocessor directives)

2024-08-28 Thread Antoine Pitrou
Is there a way to ensure this is done automatically? Regards Antoine. On Wed, 28 Aug 2024 10:05:45 +0900 (JST) Sutou Kouhei wrote: > Hi, > > How about indenting preprocessor directives for readability? > > Issue: https://github.com/apache/arrow/issues/43796 > PR: https://github.com/apache

Re: [VOTE] Split Go release process

2024-08-26 Thread Antoine Pitrou
+1 (binding) Le 26/08/2024 à 04:37, Sutou Kouhei a écrit : Hi, I would like to propose splitting Go release process. Motivation: * We want to reduce needless major releases because major releases require users' change Approach: 1. Extract go/ in apache/arrow to apache/arrow-go like a

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit : (I also happen to want a canonical Arrow representation for variant data, as this type occurs in many databases but doesn't have a great representation today in ADBC results. That's why I filed [Format] Consider adding an official variant type

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
u, Aug 22, 2024 at 3:51 PM Antoine Pitrou wrote: Hi Gang, Sorry, but can you give a pointer to the start of this discussion thread in a readable format (for example a mailing-list archive)? It appears that dev@arrow wasn't cc'ed from the start and that can make it difficult to unde

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Hi Gang, Sorry, but can you give a pointer to the start of this discussion thread in a readable format (for example a mailing-list archive)? It appears that dev@arrow wasn't cc'ed from the start and that can make it difficult to understand what this is about. Regards Antoine. Le 22/08/2

Re: [VOTE][Format] Bool8 Canonical Extension Type

2024-08-05 Thread Antoine Pitrou
Binding +1 (but posted one minor comment on the format PR). Thank you Joel! Regards Antoine. Le 05/08/2024 à 14:59, Joel Lubinitsky a écrit : Hello Devs, I would like to propose a new canonical extension type: Bool8 The prior mailing list discussion thread can be found at [1]. The format

Re: [DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-05 Thread Antoine Pitrou
I don't have any concrete data to test this against, but using 64-bit offsets sounds like an obvious improvement to me. Regards Antoine. Le 01/08/2024 à 13:05, Ruoxi Sun a écrit : Hello everyone, We've identified an issue with Acero's hash join/aggregation, which is currently limited to

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Antoine Pitrou
Le 22/07/2024 à 21:25, Joel Lubinitsky a écrit : If Canonical Extensions had existed at the time, I think there's a chance we may have ended up with int32 Date as a first class type and int64 MillisecondDate as a Canonical Extension type. Agreed. Are there any lessons we've learned from im

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
I can't > find now that new types should be implemented as extension types if > possible for these (and perhaps other) reasons. > > > On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou wrote: > > > > > > Agreed with Felipe. This is meant for communicating with no

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
out any provisions on the specification that might make this impossible. -dewey [1] https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < anto...@python.org>

Re: [DISCUSS] Split Go release process

2024-07-18 Thread Antoine Pitrou
Hi Kou, Le 18/07/2024 à 11:33, Sutou Kouhei a écrit : Here is my idea how to proceed this: 1. Extract go/ in apache/arrow to apache/arrow-go like apache/arrow-rs * Filter go/ related commits from apache/arrow and create apache/arrow-go with them like we did for apache/arrow-rs

Re: [Discuss][C++] Switch to mimalloc by default?

2024-07-16 Thread Antoine Pitrou
Hello, Thanks all for this discussion. Given that there was no strong argument against doing this, I decided to move forward and the change was made in https://github.com/apache/arrow/pull/40875 Regards Antoine. On Wed, 5 Jun 2024 17:18:36 +0200 Antoine Pitrou wrote: > Hello, > >

Re: Understanding possible synergies between arrow & zarr communities?

2024-07-16 Thread Antoine Pitrou
Hi Carl, Le 08/07/2024 à 18:43, Carl Boettiger a écrit : As an observer to both communities, I'm interested in if there is or might be more communication between the Pangeo community's focus on Zarr serialization with what the Arrow team has done with Parquet. I recognize that these are diff

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Antoine Pitrou
Hi Joel, This looks good to me on the principle. Can you split the spec and the implementation(s) into separate PRs? Regards Antoine. Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a discuss

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-16 Thread Antoine Pitrou
[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Monday, July 15th, 2024 at 07:59, Antoine Pitrou wrote: No, because these marke

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-15 Thread Antoine Pitrou
No, because these markers also communicate the information to other implementations of S3 abstractions. An example of this is: https://docs.cyberduck.io/protocols/s3/#folders Regards Antoine. Le 13/07/2024 à 07:15, Aldrin a écrit : ...then I still expect the directory /foo to exist Rig

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Antoine Pitrou
Hi, Le 12/07/2024 à 12:21, Hyunseok Seo a écrit : *### Why Maintain Empty Directory Markers?* From what I understand, object stores like S3 do not have a concept of directories. The motivation behind maintaining these markers could be to manage the object store as if it were a traditional fi

Re: [DISCUSS] Statistics through the C data interface

2024-07-01 Thread Antoine Pitrou
Hmmm, I strive to understand why a `(int32, utf8)` tuple for statistic keys would be any simpler to implement than either `int32` *or* `utf8` *or* `dictionary(int32, utf8)`. Let's keep in mind that we would like to keep things simple for consumers and producers of statistics. We should al

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
Is this UDF implementation based on DataFusion? If so, it makes sense for it to be part of the DataFusion project. OTOH, if it can work with any data in the Arrow format, then it would sound weird to maintain it in the DataFusion repo IMHO. Regards Antoine. Le 28/06/2024 à 21:52, Andrew

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
I'll note that PyArrow also allows defining user-defined functions and they are vectorized (the function arguments can be PyArrow arrays or scalars, depending on the context in which a function is being executed): https://arrow.apache.org/docs/python/compute.html#user-defined-functions My vo

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-12 Thread Antoine Pitrou
Le 12/06/2024 à 04:45, Sutou Kouhei a écrit : It seems that we need to disable MI_OVERRIDE explicitly to not define malloc() in libmimalloc.so: https://github.com/microsoft/mimalloc/blob/03020fbf81541651e24289d2f7033a772a50f480/CMakeLists.txt#L10 Yes, that's what we do when building the bund

Re: Unsupported/Other Type

2024-06-11 Thread Antoine Pitrou
Sorry, I had forgotten to comment on this. I think this is generally a good idea, but it would obviously need more eyes on it :-) Can other people go and take a look at David's PR below? Le 25/05/2024 à 04:47, David Li a écrit : I've put up a draft PR here: https://github.com/apache/arrow/

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:35, Sutou Kouhei a écrit : Hi, In <2a32f61c-dd22-4f3f-bc98-822dcb6b0...@python.org> "Re: [Discuss][C++] Switch to mimalloc by default?" on Tue, 11 Jun 2024 10:21:12 +0200, Antoine Pitrou wrote: I was thinking about find_package(). Good to know

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:01, Sutou Kouhei a écrit : 2. Is it OK that we add support for system mimalloc? Hmm... that sounds legitimate, but with the caveat that a system mimalloc can override the standard malloc/free functions. Would that affect an application using Arrow C++? Are you saying th

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-10 Thread Antoine Pitrou
Hi Kou, Le 09/06/2024 à 09:16, Sutou Kouhei a écrit : Questions: 1. Do we need to keep jemalloc support? Compatibility? Can we drop support for jemalloc to decrease maintenance cost? I'm not sure there's much maintenance cost. I expect some people might prefer jemalloc, and perhaps it

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 08:33, Sutou Kouhei a écrit : Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | 1. Should the key be

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 09:01, Sutou Kouhei a écrit : Hi, One thing that a plain integer makes more difficult is representing non-standard statistics. For example some engine might want to expose elaborate quantile-based statistics even if it not officially defined here. With a `utf8` or `dictionary(

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit : On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by

Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Antoine Pitrou
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or n

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I agr

[Discuss][C++] Switch to mimalloc by default?

2024-06-05 Thread Antoine Pitrou
Hello, Arrow C++ features a MemoryPool abstraction that allows using different allocators interchangeably. Several MemoryPool implementations are provided with Arrow C++ (though one can also build their own): - a jemalloc-based implementation, currently the default on Linux - a mimalloc-bas

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou
(Gang Wu, Antoine Pitrou, Wes McKinney) 9x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko Driesprong, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen Zhang, Rok Mihevc) Arrow: 6x +1 binding (Micah Kornfield, Antoine Pitrou, Andy Grove, Raúl Cumplido, Wes McKinney

Re: [C++] Thread deadlock in ObjectOutputStream

2024-05-29 Thread Antoine Pitrou
Hi Li! Sorry for the delay. It seems the problem lies here: https://github.com/apache/arrow/blob/9f5899019d23b2b1eae2fedb9f6be8827885d843/cpp/src/arrow/filesystem/s3fs.cc#L1858 The Future is marked finished with the ObjectOutputStream's mutex taken, and the Future's callback then triggers a c

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Antoine Pitrou
+1 (binding). Thanks for taking this up, Rok! Regards Antoine. Le 29/05/2024 à 16:14, Rok Mihevc a écrit : # sending this to both dev@arrow and dev@parquet Hi all, Following the ML discussion [1] I would like to propose a vote for parquet-cpp issues to be moved from Parquet Jira [2] to Arr

Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Antoine Pitrou
Is it somehow possible to be a "member" of this account to indicate that we have PMC status, or is that not possible within the LinkedIn membership/permissions model? Le 24/05/2024 à 18:04, Ian Cook a écrit : Following the discussion [1] earlier this year about the status of the Apache Ar

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
> 2. We'll provide pre-defined keys such as "max", "min", > > >"byte_width" and "distinct_count" but users can also use > > >application specific keys. > > > 3. If true, then the value is approximate or best-effort.

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to desc

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou
Hi Kou, I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision. Why not simply pass the statistics ArrowArray separately in your produce

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-05-14 Thread Antoine Pitrou
I think these flags should be advisory and consumers should be free to ignore them. However, some consumers apparently would benefit from them to more faithfully represent the producer's intention. For example, in Arrow C++, we could perhaps have a ImportDatum function whose actual return t

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) Le 19/04/2024 à 22:22, Rok Mihevc a écrit : Hi all, Following initial requests [1][2] and recent tangential ML discussion [3] I would like to propose a vote to add language for UUID canonical extension type to CanonicalExtensions.rst as in PR [4] and written below. A draft C++ and

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 requirement and the 3 current String types allowed. Regards Antoine. Le 30/04/2024 à 19:26, Rok Mihevc a écrit : Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259 requiremen

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
mes, and so we could use this in that context). I think that I would still prefer a canonical extension type (with storage type null) over a new dedicated type. On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou wrote: Ah! Well, I think this could be an interesting proposal, but someone should

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
Ah! Well, I think this could be an interesting proposal, but someone should put a more formal proposal, perhaps as a draft PR. Regards Antoine. Le 17/04/2024 à 11:57, David Li a écrit : For an unsupported/other extension type. On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: What

Re: AW: Personal feedback on your last release on Apache Arrow ADBC 0.11.0

2024-04-17 Thread Antoine Pitrou
Out of curiosity, did you notice this by chance or do you have some kind of script that processes ASF mailing-list archives for possible voting irregularities? Regards Antoine. Le 17/04/2024 à 10:44, Christofer Dutz a écrit : When looking at whimsy, I can’t see any person named Sutou Kou

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
eation of one-off nominal types for very specific use-cases? — Felipe On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards A

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical type for UUID and JSON.

Re: [RFC] Enabling data frames in disaggregated shared memory

2024-04-10 Thread Antoine Pitrou
Hello John, Arrow IPC files can be backed quite naturally by shared memory, simply by memory-mapping them for reading. So if you have some pieces of shared memory containing Arrow IPC files, and they are reachable using a filesystem mount point, you're pretty much done. You can see an exam

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-09 Thread Antoine Pitrou
It seems that perhaps this discussion should be rebooted for each individual component, one at a time? Let's start with something simple and obvious, with some frequent contribution activity, such as perhaps Go? Le 09/04/2024 à 14:27, Joris Van den Bossche a écrit : I am also in favor o

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-07 Thread Antoine Pitrou
Le 28/03/2024 à 21:42, Jacob Wujciak a écrit : For Arrow C++ bindings like Arrow R and PyArrow having distinct versions would require additional work to both enable the use of different versions and ensure version compatibility is monitored and potentially updated if needed. We could simply

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Thanks. The Arrow spec does support multiple union members with the same type, but not all implementations do. The C++ implementation should support it, though to my surprise we do not seem to have any tests for it. If the Java implementation doesn't, then you can probably open an issue for

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Can you explain what ADT means ? Le 02/04/2024 à 11:31, Finn Völkel a écrit : Hi, my question primarily concerns the union layout described at https://arrow.apache.org/docs/format/Columnar.html#union-layout There are two ways to use unions: - polymorphic vectors (world 1) - ADT st

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Antoine Pitrou
Regardless of whether they have different compression ratios, it doesn't explain why you would want a different compression *algorithm* altogether. The choice of a compression algorithm should basically be driven by two concerns: the acceptable space/time tradeoff (do you want to minimize d

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Antoine Pitrou
Hello Andrei, Le 23/03/2024 à 13:23, Andrei Lazăr a écrit : At this very moment, specifying different compression algorithms per column is supported and in my use case it is extremely helpful, as I have some columns (mostly containing floats), for which a compression algorithm like Snappy (or

Re: ADBC - OS-level driver manager

2024-03-20 Thread Antoine Pitrou
Also, with ADBC driver implementations currently in flux (none of them has reached the "stable" status in https://arrow.apache.org/adbc/main/driver/status.html), it might be a disservice to users to implicitly fetch drivers from potentially outdated DLLs on the current system. Regards Ant

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Antoine Pitrou
Congratulations Bryce, and keep up the good work! Regards Antoine. Le 18/03/2024 à 03:21, Nic Crane a écrit : On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions! N

Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-04 Thread Antoine Pitrou
I didn't run the release script but I'm +1 on this (binding). Regards Antoine. Le 04/03/2024 à 10:05, Raúl Cumplido a écrit : Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 15.0.1. This is a release consisting of 37 resolved GitHub issues[1]. Thi

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
et me know as I want as many parties in the community as possible to be part of this. Thanks everyone. --Matt On Tue, Feb 27, 2024 at 12:48 PM Antoine Pitrou wrote: Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adop

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adopted as an Arrow spec. Regards Antoine. Le 27/02/2024 à 18:35, Matt Topol a écrit : Hey all, I'd like to propose a vote for us to officially adopt the protocol described

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-14 Thread Antoine Pitrou
agenda for today's bi-weekly call. Thanks, Raúl El mar, 13 feb 2024 a las 23:20, Antoine Pitrou () escribió: Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-13 Thread Antoine Pitrou
Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow Java identified an issue[1] in the 15.0.0 release. There is an undefined symbol in the dataset module that cau

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-13 Thread Antoine Pitrou
l service as a fallback. Are these the intended semantics? If so, is there a way to include the original service in the list of locations without the implied precedence? Thanks, Joel On Mon, Feb 12, 2024 at 11:52 James Duong .invalid> wrote: This seems like a good idea, and also improves consist

Re: [ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-12 Thread Antoine Pitrou
Hi Dewey, Le 12/02/2024 à 15:01, Dewey Dunnington a écrit : Apache Arrow nanoarrow is a small C library for building and interpreting Arrow C Data interface structures with bindings for users of the R programming language. Do you want to reconsider this sentence? It seems nanoarrow is starti

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-12 Thread Antoine Pitrou
Hello, This looks fine to me. Regards Antoine. Le 12/02/2024 à 14:46, David Li a écrit : Hello, I'd like to propose a slight update to Flight RPC to make Flight SQL work better in different deployment scenarios. Comments on the doc would be appreciated: https://docs.google.com/documen

Re: [DISCUSS] Proposal to expand Arrow Communications

2024-02-07 Thread Antoine Pitrou
I think we should find a proper descriptive name for the "high-performance protocol", because "high-performance" is vague and context-dependent, and also spreads unnecessary confusion about existing alternatives such as regular Arrow IPC. I would for example propose "Dissociated Arrow IPC"

Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-27 Thread Antoine Pitrou
My 2 cents : I don't understand what an open source project gains by publishing on a microblogging platform. As for Twitter specifically, its recent governance changes would be good reason for terminating the @ApacheArrow account, IMHO. Regards Antoine. Le 27/01/2024 à 23:06, Bryce Mecu

Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-25 Thread Antoine Pitrou
Hello, My own answers: 1) isDelta should be true only when a delta is being transmitted (to be appended to the existing dictionary with the same id); it should be false when a full dictionary is being transmitted (to replace the existing dictionary with the same id, if any) 2) yes, it coul

Re: [DataFusion] New Blog Post -- DataFusion 34.0

2024-01-23 Thread Antoine Pitrou
Impressive, thank you! Le 23/01/2024 à 14:06, Andrew Lamb a écrit : If anyone is interested, here is a new blog post about the last 6 months in DataFusion[1] and where we are heading this year. Andrew [1]: https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/

Re: [DISC] Improve Arrow Release verification process

2024-01-19 Thread Antoine Pitrou
Well, if the main objective is to just follow the ASF Release guidelines, then our verification process can be simplified drastically. The ASF indeed just requires: """ Every ASF release MUST contain one or more source packages, which MUST be sufficient for a user to build and test the relea

Re: [VOTE] Release Apache Arrow 15.0.0 - RC1

2024-01-17 Thread Antoine Pitrou
Go verification fails on Ubuntu 22.04: ``` # google.golang.org/grpc ../../gopath/pkg/mod/google.golang.org/grpc@v1.58.3/server.go:2096:14: undefined: atomic.Int64 note: module requires Go 1.19 # github.com/apache/arrow/go/v15/arrow/avro arrow/avro/reader_types.go:594:16: undefined: fmt.Append

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Antoine Pitrou
Hi, For now, I would suggest that each implementation decides on their own strategy, because we don't have a clear idea of which is better (and extension types are probably not getting a lot of use yet). Regards Antoine. Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : The main proble

Re: Java, dictionary ids and schema equality

2023-12-09 Thread Antoine Pitrou
Hi Curt, Yes, it's a problem in the Java implementation of these tests. Ideally this should be fixed, but doing so would require some amount of scaffolding. Regards Antoine. Le 09/12/2023 à 21:47, Curt Hagenlocher a écrit : I've (mostly) fixed the C# implementation of dictionary IPC but

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Antoine Pitrou
+1 (binding) Le 08/12/2023 à 20:42, David Li a écrit : Let's start a formal vote just so we're on the same page now that we've discussed a few things. I would like to propose we remove 'experimental' from Flight SQL and make it stable: - Remove the 'experimental' option from the Protobuf de

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2023-12-06 Thread Antoine Pitrou
Hi, While this looks like a nice start, I would expect more precise recommendations for writing non-trivial services. Especially, one question is how to send both an application-specific POST request and an Arrow stream, or an application-specific GET response and an Arrow stream. This migh

Re: [Discussion][Gandiva] Migration JIT engine from MCJIT to ORC v2

2023-12-06 Thread Antoine Pitrou
Given that MCJIT is deprecated and there doesn't seem to be a downside to the new APIs, migrating to ORC v2 sounds fine to me. Just a question: does it raise the minimum supported LLVM version? Regards Antoine. Le 05/12/2023 à 03:35, Yue Ni a écrit : Hi there, I'd like to initiate a dis

Re: CIDR 2024

2023-12-06 Thread Antoine Pitrou
For the sake of clarity, it seems this is talking about the Conference on Innovative Data Systems Research: https://www.cidrdb.org/cidr2024/ Regards Antoine. Le 06/12/2023 à 01:15, Wes McKinney a écrit : I will also be there. On Mon, Dec 4, 2023 at 12:58 PM Tony Wang wrote: I am Get

Re: Documentation of Breaking Changes

2023-11-21 Thread Antoine Pitrou
Hello, Le 21/11/2023 à 22:59, Chris Thomas a écrit : I apologize if this is not the appropriate venue for this request; if that's the case, please let me know where I should be asking: Earlier this month Dependabot flagged a security vulnerability with PyArrow which prompted us to do an upgr

Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-20 Thread Antoine Pitrou
I also agree that an informal spec "how to efficiently transfer Arrow data over HTTP" makes sense. Probably with several aspects: - one-shot GET data - streaming GET - one-shot PUT or POST - streaming POST - non-Arrow prologue and epilogue (for example JSON-based metadata) - conventions for w

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Antoine Pitrou
Welcome Raul, we're glad to have you! Regards Antoine. Le 13/11/2023 à 20:27, Andrew Lamb a écrit : The Project Management Committee (PMC) for Apache Arrow has invited Raúl Cumplido to become a PMC member and we are pleased to announce that Raúl Cumplido has accepted. Please join me in c

Re: decimal64

2023-11-09 Thread Antoine Pitrou
ormat/CanonicalExtensions.html On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a deci

Re: decimal64

2023-11-09 Thread Antoine Pitrou
Nov 9, 2023, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a decimal64 type, someone could ch

Re: decimal64

2023-11-09 Thread Antoine Pitrou
money column knowing that there are edge cases where they may get an undesired result. On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou wrote: Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent

Re: decimal64

2023-11-09 Thread Antoine Pitrou
Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent it from being stored in one so that you can describe the column as "decimal(18, 4)"? That's what we do for other decimal types, see PyArrow below: ```

Re: [VOTE][FORMAT] Bulk ingestion support for Flight SQL

2023-11-09 Thread Antoine Pitrou
For the record, the correct PR link seems to be https://github.com/apache/arrow/pull/38385 Le 08/11/2023 à 21:49, David Li a écrit : Hello, Joel Lubi has proposed adding bulk ingestion support to Arrow Flight SQL [1]. This provides a path for uploading an Arrow dataset to a Flight SQL ser

CVE-2023-47248: PyArrow, PyArrow: Arbitrary code execution when loading a malicious data file

2023-11-08 Thread Antoine Pitrou
Severity: critical Affected versions: - PyArrow 0.14.0 through 14.0.0 - PyArrow 0.14.0 through 14.0.0 Description: Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 allows arbitrary code execution. An application is vulnerable if it reads Arrow

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the specia

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 18:59, Dewey Dunnington a écrit : That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). By the wa

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > A potential alternative might be to allow any ArrowArray to declare > its buffer sizes in array->buffers[array->n_buffers], perhaps with a > new flag in schema->flags to advertise that capability. That sounds a bit hackish to me. I'd rather l

  1   2   3   4   5   6   7   8   9   10   >