Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Antoine Pitrou
Le 22/07/2024 à 21:25, Joel Lubinitsky a écrit : If Canonical Extensions had existed at the time, I think there's a chance we may have ended up with int32 Date as a first class type and int64 MillisecondDate as a Canonical Extension type. Agreed. Are there any lessons we've learned from

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
> find now that new types should be implemented as extension types if > possible for these (and perhaps other) reasons. > > > On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou wrote: > > > > > > Agreed with Felipe. This is meant for communicating with non-Arrow type &g

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
provisions on the specification that might make this impossible. -dewey [1] https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < anto...@python.org> wrote: Hi J

Re: [DISCUSS] Split Go release process

2024-07-18 Thread Antoine Pitrou
Hi Kou, Le 18/07/2024 à 11:33, Sutou Kouhei a écrit : Here is my idea how to proceed this: 1. Extract go/ in apache/arrow to apache/arrow-go like apache/arrow-rs * Filter go/ related commits from apache/arrow and create apache/arrow-go with them like we did for apache/arrow-rs

Re: [Discuss][C++] Switch to mimalloc by default?

2024-07-16 Thread Antoine Pitrou
Hello, Thanks all for this discussion. Given that there was no strong argument against doing this, I decided to move forward and the change was made in https://github.com/apache/arrow/pull/40875 Regards Antoine. On Wed, 5 Jun 2024 17:18:36 +0200 Antoine Pitrou wrote: > Hello, > >

Re: Understanding possible synergies between arrow & zarr communities?

2024-07-16 Thread Antoine Pitrou
Hi Carl, Le 08/07/2024 à 18:43, Carl Boettiger a écrit : As an observer to both communities, I'm interested in if there is or might be more communication between the Pangeo community's focus on Zarr serialization with what the Arrow team has done with Parquet. I recognize that these are

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Antoine Pitrou
Hi Joel, This looks good to me on the principle. Can you split the spec and the implementation(s) into separate PRs? Regards Antoine. Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-16 Thread Antoine Pitrou
/docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Monday, July 15th, 2024 at 07:59, Antoine Pitrou wrote: No, because these markers also communi

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-15 Thread Antoine Pitrou
No, because these markers also communicate the information to other implementations of S3 abstractions. An example of this is: https://docs.cyberduck.io/protocols/s3/#folders Regards Antoine. Le 13/07/2024 à 07:15, Aldrin a écrit : ...then I still expect the directory /foo to exist

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Antoine Pitrou
Hi, Le 12/07/2024 à 12:21, Hyunseok Seo a écrit : *### Why Maintain Empty Directory Markers?* From what I understand, object stores like S3 do not have a concept of directories. The motivation behind maintaining these markers could be to manage the object store as if it were a traditional

Re: [DISCUSS] Statistics through the C data interface

2024-07-01 Thread Antoine Pitrou
Hmmm, I strive to understand why a `(int32, utf8)` tuple for statistic keys would be any simpler to implement than either `int32` *or* `utf8` *or* `dictionary(int32, utf8)`. Let's keep in mind that we would like to keep things simple for consumers and producers of statistics. We should

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
Is this UDF implementation based on DataFusion? If so, it makes sense for it to be part of the DataFusion project. OTOH, if it can work with any data in the Arrow format, then it would sound weird to maintain it in the DataFusion repo IMHO. Regards Antoine. Le 28/06/2024 à 21:52,

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
I'll note that PyArrow also allows defining user-defined functions and they are vectorized (the function arguments can be PyArrow arrays or scalars, depending on the context in which a function is being executed): https://arrow.apache.org/docs/python/compute.html#user-defined-functions My

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-12 Thread Antoine Pitrou
Le 12/06/2024 à 04:45, Sutou Kouhei a écrit : It seems that we need to disable MI_OVERRIDE explicitly to not define malloc() in libmimalloc.so: https://github.com/microsoft/mimalloc/blob/03020fbf81541651e24289d2f7033a772a50f480/CMakeLists.txt#L10 Yes, that's what we do when building the

Re: Unsupported/Other Type

2024-06-11 Thread Antoine Pitrou
Sorry, I had forgotten to comment on this. I think this is generally a good idea, but it would obviously need more eyes on it :-) Can other people go and take a look at David's PR below? Le 25/05/2024 à 04:47, David Li a écrit : I've put up a draft PR here:

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:35, Sutou Kouhei a écrit : Hi, In <2a32f61c-dd22-4f3f-bc98-822dcb6b0...@python.org> "Re: [Discuss][C++] Switch to mimalloc by default?" on Tue, 11 Jun 2024 10:21:12 +0200, Antoine Pitrou wrote: I was thinking about find_package(). Good to know

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:01, Sutou Kouhei a écrit : 2. Is it OK that we add support for system mimalloc? Hmm... that sounds legitimate, but with the caveat that a system mimalloc can override the standard malloc/free functions. Would that affect an application using Arrow C++? Are you saying

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-10 Thread Antoine Pitrou
Hi Kou, Le 09/06/2024 à 09:16, Sutou Kouhei a écrit : Questions: 1. Do we need to keep jemalloc support? Compatibility? Can we drop support for jemalloc to decrease maintenance cost? I'm not sure there's much maintenance cost. I expect some people might prefer jemalloc, and perhaps

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 08:33, Sutou Kouhei a écrit : Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | 1. Should the key be

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 09:01, Sutou Kouhei a écrit : Hi, One thing that a plain integer makes more difficult is representing non-standard statistics. For example some engine might want to expose elaborate quantile-based statistics even if it not officially defined here. With a `utf8` or

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit : On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both

Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Antoine Pitrou
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I

[Discuss][C++] Switch to mimalloc by default?

2024-06-05 Thread Antoine Pitrou
Hello, Arrow C++ features a MemoryPool abstraction that allows using different allocators interchangeably. Several MemoryPool implementations are provided with Arrow C++ (though one can also build their own): - a jemalloc-based implementation, currently the default on Linux - a

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou
(Gang Wu, Antoine Pitrou, Wes McKinney) 9x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko Driesprong, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen Zhang, Rok Mihevc) Arrow: 6x +1 binding (Micah Kornfield, Antoine Pitrou, Andy Grove, Raúl Cumplido, Wes McKinney

Re: [C++] Thread deadlock in ObjectOutputStream

2024-05-29 Thread Antoine Pitrou
Hi Li! Sorry for the delay. It seems the problem lies here: https://github.com/apache/arrow/blob/9f5899019d23b2b1eae2fedb9f6be8827885d843/cpp/src/arrow/filesystem/s3fs.cc#L1858 The Future is marked finished with the ObjectOutputStream's mutex taken, and the Future's callback then triggers a

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Antoine Pitrou
+1 (binding). Thanks for taking this up, Rok! Regards Antoine. Le 29/05/2024 à 16:14, Rok Mihevc a écrit : # sending this to both dev@arrow and dev@parquet Hi all, Following the ML discussion [1] I would like to propose a vote for parquet-cpp issues to be moved from Parquet Jira [2] to

Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Antoine Pitrou
Is it somehow possible to be a "member" of this account to indicate that we have PMC status, or is that not possible within the LinkedIn membership/permissions model? Le 24/05/2024 à 18:04, Ian Cook a écrit : Following the discussion [1] earlier this year about the status of the Apache

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
t;, "min", > > >"byte_width" and "distinct_count" but users can also use > > >application specific keys. > > > 3. If true, then the value is approximate or best-effort. > > > > > > VALUE_SCHEMA is a dense union with

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou
Hi Kou, I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision. Why not simply pass the statistics ArrowArray separately in your

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-05-14 Thread Antoine Pitrou
I think these flags should be advisory and consumers should be free to ignore them. However, some consumers apparently would benefit from them to more faithfully represent the producer's intention. For example, in Arrow C++, we could perhaps have a ImportDatum function whose actual return

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) Le 19/04/2024 à 22:22, Rok Mihevc a écrit : Hi all, Following initial requests [1][2] and recent tangential ML discussion [3] I would like to propose a vote to add language for UUID canonical extension type to CanonicalExtensions.rst as in PR [4] and written below. A draft C++

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 requirement and the 3 current String types allowed. Regards Antoine. Le 30/04/2024 à 19:26, Rok Mihevc a écrit : Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
o we could use this in that context). I think that I would still prefer a canonical extension type (with storage type null) over a new dedicated type. On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou wrote: Ah! Well, I think this could be an interesting proposal, but someone should put a mor

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
Ah! Well, I think this could be an interesting proposal, but someone should put a more formal proposal, perhaps as a draft PR. Regards Antoine. Le 17/04/2024 à 11:57, David Li a écrit : For an unsupported/other extension type. On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: What

Re: AW: Personal feedback on your last release on Apache Arrow ADBC 0.11.0

2024-04-17 Thread Antoine Pitrou
Out of curiosity, did you notice this by chance or do you have some kind of script that processes ASF mailing-list archives for possible voting irregularities? Regards Antoine. Le 17/04/2024 à 10:44, Christofer Dutz a écrit : When looking at whimsy, I can’t see any person named Sutou

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
ne-off nominal types for very specific use-cases? — Felipe On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical type for UUID and

Re: [RFC] Enabling data frames in disaggregated shared memory

2024-04-10 Thread Antoine Pitrou
Hello John, Arrow IPC files can be backed quite naturally by shared memory, simply by memory-mapping them for reading. So if you have some pieces of shared memory containing Arrow IPC files, and they are reachable using a filesystem mount point, you're pretty much done. You can see an

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-09 Thread Antoine Pitrou
It seems that perhaps this discussion should be rebooted for each individual component, one at a time? Let's start with something simple and obvious, with some frequent contribution activity, such as perhaps Go? Le 09/04/2024 à 14:27, Joris Van den Bossche a écrit : I am also in favor

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-07 Thread Antoine Pitrou
Le 28/03/2024 à 21:42, Jacob Wujciak a écrit : For Arrow C++ bindings like Arrow R and PyArrow having distinct versions would require additional work to both enable the use of different versions and ensure version compatibility is monitored and potentially updated if needed. We could simply

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Thanks. The Arrow spec does support multiple union members with the same type, but not all implementations do. The C++ implementation should support it, though to my surprise we do not seem to have any tests for it. If the Java implementation doesn't, then you can probably open an issue

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Can you explain what ADT means ? Le 02/04/2024 à 11:31, Finn Völkel a écrit : Hi, my question primarily concerns the union layout described at https://arrow.apache.org/docs/format/Columnar.html#union-layout There are two ways to use unions: - polymorphic vectors (world 1) - ADT

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Antoine Pitrou
Regardless of whether they have different compression ratios, it doesn't explain why you would want a different compression *algorithm* altogether. The choice of a compression algorithm should basically be driven by two concerns: the acceptable space/time tradeoff (do you want to minimize

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Antoine Pitrou
Hello Andrei, Le 23/03/2024 à 13:23, Andrei Lazăr a écrit : At this very moment, specifying different compression algorithms per column is supported and in my use case it is extremely helpful, as I have some columns (mostly containing floats), for which a compression algorithm like Snappy

Re: ADBC - OS-level driver manager

2024-03-20 Thread Antoine Pitrou
Also, with ADBC driver implementations currently in flux (none of them has reached the "stable" status in https://arrow.apache.org/adbc/main/driver/status.html), it might be a disservice to users to implicitly fetch drivers from potentially outdated DLLs on the current system. Regards

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Antoine Pitrou
Congratulations Bryce, and keep up the good work! Regards Antoine. Le 18/03/2024 à 03:21, Nic Crane a écrit : On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions!

Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-04 Thread Antoine Pitrou
I didn't run the release script but I'm +1 on this (binding). Regards Antoine. Le 04/03/2024 à 10:05, Raúl Cumplido a écrit : Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 15.0.1. This is a release consisting of 37 resolved GitHub issues[1].

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
want as many parties in the community as possible to be part of this. Thanks everyone. --Matt On Tue, Feb 27, 2024 at 12:48 PM Antoine Pitrou wrote: Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adopted as an Arrow spec

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adopted as an Arrow spec. Regards Antoine. Le 27/02/2024 à 18:35, Matt Topol a écrit : Hey all, I'd like to propose a vote for us to officially adopt the protocol described

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-14 Thread Antoine Pitrou
for today's bi-weekly call. Thanks, Raúl El mar, 13 feb 2024 a las 23:20, Antoine Pitrou () escribió: Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow Java identified

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-13 Thread Antoine Pitrou
Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow Java identified an issue[1] in the 15.0.0 release. There is an undefined symbol in the dataset module that

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-13 Thread Antoine Pitrou
ed semantics? If so, is there a way to include the original service in the list of locations without the implied precedence? Thanks, Joel On Mon, Feb 12, 2024 at 11:52 James Duong .invalid> wrote: This seems like a good idea, and also improves consistency with clients that erroneously assumed that th

Re: [ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-12 Thread Antoine Pitrou
Hi Dewey, Le 12/02/2024 à 15:01, Dewey Dunnington a écrit : Apache Arrow nanoarrow is a small C library for building and interpreting Arrow C Data interface structures with bindings for users of the R programming language. Do you want to reconsider this sentence? It seems nanoarrow is

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-12 Thread Antoine Pitrou
Hello, This looks fine to me. Regards Antoine. Le 12/02/2024 à 14:46, David Li a écrit : Hello, I'd like to propose a slight update to Flight RPC to make Flight SQL work better in different deployment scenarios. Comments on the doc would be appreciated:

Re: [DISCUSS] Proposal to expand Arrow Communications

2024-02-07 Thread Antoine Pitrou
I think we should find a proper descriptive name for the "high-performance protocol", because "high-performance" is vague and context-dependent, and also spreads unnecessary confusion about existing alternatives such as regular Arrow IPC. I would for example propose "Dissociated Arrow IPC"

Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-27 Thread Antoine Pitrou
My 2 cents : I don't understand what an open source project gains by publishing on a microblogging platform. As for Twitter specifically, its recent governance changes would be good reason for terminating the @ApacheArrow account, IMHO. Regards Antoine. Le 27/01/2024 à 23:06, Bryce

Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-25 Thread Antoine Pitrou
Hello, My own answers: 1) isDelta should be true only when a delta is being transmitted (to be appended to the existing dictionary with the same id); it should be false when a full dictionary is being transmitted (to replace the existing dictionary with the same id, if any) 2) yes, it

Re: [DataFusion] New Blog Post -- DataFusion 34.0

2024-01-23 Thread Antoine Pitrou
Impressive, thank you! Le 23/01/2024 à 14:06, Andrew Lamb a écrit : If anyone is interested, here is a new blog post about the last 6 months in DataFusion[1] and where we are heading this year. Andrew [1]: https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/

Re: [DISC] Improve Arrow Release verification process

2024-01-19 Thread Antoine Pitrou
Well, if the main objective is to just follow the ASF Release guidelines, then our verification process can be simplified drastically. The ASF indeed just requires: """ Every ASF release MUST contain one or more source packages, which MUST be sufficient for a user to build and test the

Re: [VOTE] Release Apache Arrow 15.0.0 - RC1

2024-01-17 Thread Antoine Pitrou
Go verification fails on Ubuntu 22.04: ``` # google.golang.org/grpc ../../gopath/pkg/mod/google.golang.org/grpc@v1.58.3/server.go:2096:14: undefined: atomic.Int64 note: module requires Go 1.19 # github.com/apache/arrow/go/v15/arrow/avro arrow/avro/reader_types.go:594:16: undefined:

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Antoine Pitrou
Hi, For now, I would suggest that each implementation decides on their own strategy, because we don't have a clear idea of which is better (and extension types are probably not getting a lot of use yet). Regards Antoine. Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : The main

Re: Java, dictionary ids and schema equality

2023-12-09 Thread Antoine Pitrou
Hi Curt, Yes, it's a problem in the Java implementation of these tests. Ideally this should be fixed, but doing so would require some amount of scaffolding. Regards Antoine. Le 09/12/2023 à 21:47, Curt Hagenlocher a écrit : I've (mostly) fixed the C# implementation of dictionary IPC but

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Antoine Pitrou
+1 (binding) Le 08/12/2023 à 20:42, David Li a écrit : Let's start a formal vote just so we're on the same page now that we've discussed a few things. I would like to propose we remove 'experimental' from Flight SQL and make it stable: - Remove the 'experimental' option from the Protobuf

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2023-12-06 Thread Antoine Pitrou
Hi, While this looks like a nice start, I would expect more precise recommendations for writing non-trivial services. Especially, one question is how to send both an application-specific POST request and an Arrow stream, or an application-specific GET response and an Arrow stream. This

Re: [Discussion][Gandiva] Migration JIT engine from MCJIT to ORC v2

2023-12-06 Thread Antoine Pitrou
Given that MCJIT is deprecated and there doesn't seem to be a downside to the new APIs, migrating to ORC v2 sounds fine to me. Just a question: does it raise the minimum supported LLVM version? Regards Antoine. Le 05/12/2023 à 03:35, Yue Ni a écrit : Hi there, I'd like to initiate a

Re: CIDR 2024

2023-12-06 Thread Antoine Pitrou
For the sake of clarity, it seems this is talking about the Conference on Innovative Data Systems Research: https://www.cidrdb.org/cidr2024/ Regards Antoine. Le 06/12/2023 à 01:15, Wes McKinney a écrit : I will also be there. On Mon, Dec 4, 2023 at 12:58 PM Tony Wang wrote: I am Get

Re: Documentation of Breaking Changes

2023-11-21 Thread Antoine Pitrou
Hello, Le 21/11/2023 à 22:59, Chris Thomas a écrit : I apologize if this is not the appropriate venue for this request; if that's the case, please let me know where I should be asking: Earlier this month Dependabot flagged a security vulnerability with PyArrow which prompted us to do an

Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-20 Thread Antoine Pitrou
I also agree that an informal spec "how to efficiently transfer Arrow data over HTTP" makes sense. Probably with several aspects: - one-shot GET data - streaming GET - one-shot PUT or POST - streaming POST - non-Arrow prologue and epilogue (for example JSON-based metadata) - conventions for

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Antoine Pitrou
Welcome Raul, we're glad to have you! Regards Antoine. Le 13/11/2023 à 20:27, Andrew Lamb a écrit : The Project Management Committee (PMC) for Apache Arrow has invited Raúl Cumplido to become a PMC member and we are pleased to announce that Raúl Cumplido has accepted. Please join me in

Re: decimal64

2023-11-09 Thread Antoine Pitrou
/CanonicalExtensions.html On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a decimal64 type

Re: decimal64

2023-11-09 Thread Antoine Pitrou
, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a decimal64 type, someone could choose to use

Re: decimal64

2023-11-09 Thread Antoine Pitrou
column knowing that there are edge cases where they may get an undesired result. On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou wrote: Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent it

Re: decimal64

2023-11-09 Thread Antoine Pitrou
Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent it from being stored in one so that you can describe the column as "decimal(18, 4)"? That's what we do for other decimal types, see PyArrow below: ```

Re: [VOTE][FORMAT] Bulk ingestion support for Flight SQL

2023-11-09 Thread Antoine Pitrou
For the record, the correct PR link seems to be https://github.com/apache/arrow/pull/38385 Le 08/11/2023 à 21:49, David Li a écrit : Hello, Joel Lubi has proposed adding bulk ingestion support to Arrow Flight SQL [1]. This provides a path for uploading an Arrow dataset to a Flight SQL

CVE-2023-47248: PyArrow, PyArrow: Arbitrary code execution when loading a malicious data file

2023-11-08 Thread Antoine Pitrou
Severity: critical Affected versions: - PyArrow 0.14.0 through 14.0.0 - PyArrow 0.14.0 through 14.0.0 Description: Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 allows arbitrary code execution. An application is vulnerable if it reads Arrow

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 18:59, Dewey Dunnington a écrit : That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). By the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > A potential alternative might be to allow any ArrowArray to declare > its buffer sizes in array->buffers[array->n_buffers], perhaps with a > new flag in schema->flags to advertise that capability. That sounds a bit hackish to me. I'd rather

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Antoine Pitrou
Hello, We might want to keep the variadic buffers at the end and instead export the buffer sizes as buffer #2? Though that's mostly stylistic... Regards Antoine. Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit : Hello all, The C ABI does not store buffer lengths explicitly, which

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Antoine Pitrou
Welcome Xuwei! Le 23/10/2023 à 05:28, Sutou Kouhei a écrit : On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions!

Re: [Format] C Data Interface integration testing

2023-10-19 Thread Antoine Pitrou
active the community is being, I'm reasonably confident that they'll come to it soon :) Regards Antoine. Le 26/09/2023 à 14:46, Antoine Pitrou a écrit : Hello, We have added some infrastructure for integration testing of the C Data Interface between Arrow implementations. We are now testing

Re: Apache Arrow file format

2023-10-18 Thread Antoine Pitrou
The fact that they describe Arrow and Feather as distinct formats (they're not!) with different characteristics is a bit of a bummer. Le 18/10/2023 à 22:20, Andrew Lamb a écrit : If you are looking for a more formal discussion and empirical analysis of the differences, I suggest reading "A

Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Antoine Pitrou
+1 Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit : Hello all, I propose "vu" and "vz" as format strings for the Utf8View and BinaryView types in the Arrow C data interface [1]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of these new C data format strings [ ] +0 [ ]

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-14 Thread Antoine Pitrou
Welcome to the PMC, Jon! Le 14/10/2023 à 19:42, David Li a écrit : Congrats Jon! On Sat, Oct 14, 2023, at 13:25, Ian Cook wrote: Congratulations Jonathan! On Sat, Oct 14, 2023 at 13:24 Andrew Lamb wrote: The Project Management Committee (PMC) for Apache Arrow has invited Jonathan Keane

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-11 Thread Antoine Pitrou
 PM Antoine Pitrou wrote: Hi Alva, I'll let others give their opinions on the repo. Regards Antoine. Le 10/10/2023 à 19:25, Alva Bandy a écrit : Hi Antoine, Thanks for the reply. It would be great to get the Swift implementation added to the integration test. I have a task for adding

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou
not looked into Julia’s implementation. Thank you, Alva Bandy On 2023/10/10 08:54:30 Antoine Pitrou wrote: Hello Alva, This is a reasonable request, but it might come with its own drawbacks as well. One significant drawback is that adding the Swift implementation to the cross-implementation integration

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou
Hello Alva, This is a reasonable request, but it might come with its own drawbacks as well. One significant drawback is that adding the Swift implementation to the cross-implementation integration tests will be slightly more complicated. It is very important that all Arrow implementations

Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Antoine Pitrou
+1 from me. But I also reiterate my plea that these existing parsers get fixed so as to entirely validate the format string instead of stopping early. Regards Antoine. Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit : Hello, I'm writing to propose "+vl" and "+vL" as format

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Antoine Pitrou
); +} else { + type_ = list_view(field); +} + } else { +return f_parser_.Invalid(); + } +} + return Status::OK(); } -- Felipe On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou wrote: I don't think the parsing will be a problem even in C. It's not like

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Antoine Pitrou
I don't think the parsing will be a problem even in C. It's not like you have to backtrack anyway. +1 from me on Felipe's proposal. Regards Antoine. Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit : This mailing list thread is going to be the discussion. The union types also

Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Antoine Pitrou
+1 from me. It might be worth spelling out whether any relationship is expected between the `app_metadata` for a FlightInfo and any of the corresponding `FlightEndpoint`s and `FlightData` chunks. Le 12/09/2023 à 17:48, Matt Topol a écrit : Hey all, I would like to propose adding a new

Re: [DISCUSS][C++] Raw pointer string views

2023-10-03 Thread Antoine Pitrou
Le 03/10/2023 à 01:36, Matt Topol a écrit : The cost of conversion is actually significantly higher than the actual overhead of simply accessing the values in either representation, leading to a high potential for bottleneck. For systems like Velox and DuckDB where it's important to be able

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
approach be willing to meet us in the middle and switch to an offset based encoding? This to me feels like it would be the best outcome for the ecosystem as a whole. Kind Regards, Raphael On 02/10/2023 13:50, Antoine Pitrou wrote: Le 01/10/2023 à 16:21, Micah Kornfield a écrit : I would also

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Antoine Pitrou
Hello, +1 and thanks for working on this! There'll probably be some minor comments to the format PR, but those don't deter from accepting these new layouts into the standard. Regards Antoine. Le 29/09/2023 à 14:09, Felipe Oliveira Carvalho a écrit : Hello, I'd like to propose adding

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
Le 01/10/2023 à 16:21, Micah Kornfield a écrit : I would also assert that another way to reduce this risk is to add some prose to the relevant sections of the columnar format specification doc to clearly explain that a raw pointers variant of the layout, while not part of the official spec,

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou
be clearly flagged as being non-Arrow compliant. It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`). But, they could be also be provided by a distinct library. Regards Antoine. Le 28/09/2023 à 09:01, Antoine

  1   2   3   4   5   6   7   8   9   10   >