from:"Weston Pace"

Re: [ANNOUNCE] New Arrow PMC chair: Neil Richardson

2024-10-30 Thread Weston Pace

Congrats Neal!

On Wed, Oct 30, 2024, 4:46 AM Raúl Cumplido  wrote:

> Thanks Andy for your work during last year and thanks Neal for stepping up!
>
> Raúl
>
> El mié, 30 oct 2024 a las 12:28, Andrew Lamb ()
> escribió:
> >
> > I am pleased to announce that the Arrow Project has a new PMC chair and
> VP
> > as per our tradition of rotating the chair once a year. Andy Grove has
> > resigned and
> > Neil Richardson was duly elected by the PMC and approved unanimously by
> the
> > board.
> >
> > Please join me in congratulating Neil Richardson!
> >
> > Thanks,
> > Andrew
>

[ANNOUNCE] New Arrow committer: Rossi Sun

2024-10-22 Thread Weston Pace

On behalf of the Arrow PMC, I'm happy to announce that Rossi Sun has
accepted an invitation to become a committer on Apache Arrow. Welcome,
and thank you for your contributions!

Re: Query of Arrow Flight SQL with S3 as a storage for parquet files

2024-10-16 Thread Weston Pace

> Do you folks believe Duckdb and Datafusion (latter being similar to spark
sql) will be an overkill?

No, I don't believe it would be overkill.

I also wouldn't compare either one to Spark SQL.  Spark SQL is meant to be
a distributed query engine that typically requires a cluster of some sort
to operate at full performance.  A distributed query engine would probably
be overkill for your situation.

Both DuckDb and Datafusion are meant to be lightweight, embeddable, single
node (i.e. not distributed) query engine libraries.  These are probably a
good fit for your use case.

-Weston

On Wed, Oct 16, 2024 at 8:17 AM Susmit Sarkar 
wrote:

> Thanks David and Felipe for your help, I will definitely try out and keep
> you folks updated.
>
> Do you folks believe Duckdb and Datafusion (latter being similar to spark
> sql) will be an overkill?
>
> Thanks,
> Susmit
>
> On Wed, Oct 16, 2024 at 8:25 PM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > Hi Susmit,
> >
> > For an example of what David Li is proposing, you can take a look at this
> > project (https://github.com/voltrondata/sqlflite). It's a Flight SQL
> > server
> > (in C++ though) that can forward queries to either SQLite or DuckDB.
> >
> > --
> > Felipe
> >
> > On Wed, Oct 16, 2024 at 10:22 AM David Li  wrote:
> >
> > > If your clients are sending full SQL queries to be executed, and you
> need
> > > to execute them against S3 on the server, why not consider something
> like
> > > Apache DataFusion or DuckDB to implement that part instead of building
> > the
> > > query parser/engine yourself? (There are probably already examples of
> > > wrapping both these projects in Flight SQL floating around.)
> > >
> > > On Wed, Oct 16, 2024, at 21:38, Susmit Sarkar wrote:
> > > > Hi Community Members
> > > >
> > > >
> > > > We are planning to build an Arrow flight server on top of data lying
> in
> > > s3.
> > > >
> > > >
> > > > *Detailed Use Case:*
> > > >
> > > >
> > > > The requirement is we need to sync data from HDFS to a short term
> > storage
> > > > S3 is our case. Basically a DataSync Service between cloud storages
> > > >
> > > >
> > > > I have already built the service using Apache Pekko / Akka HDFS & S3
> > > > connectors, and data is in sync with HDFS & S3.
> > > >
> > > >
> > > > Now comes the data reading part for end users. The data is stored in
> > > > Cloudian s3 (Cloudian managed S3 not AWS) short term storage in
> > parquet.
> > > We
> > > > want to build a Data as a Service on top of the data lying in S3 and
> > > expose
> > > > API endpoints for clients to query. The data lying will be short
> term,
> > > data
> > > > may be of week or months (max 3 months) use-cases varies from teams
> to
> > > > teams.
> > > >
> > > >
> > > > So we felt Apache Sql Flight Server will be the best suited for our
> use
> > > > case and the client should send a FlightDescriptor object wrapped
> with
> > > the
> > > > sql query.
> > > >
> > > >
> > > > We parsed the query and query s3 using the aws s3 sdks, and return
> the
> > > > data, but the issue is we will end up building our own query parser,
> > > which
> > > > is a bigger task.
> > > >
> > > > Is there any other approach we can try out ?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Susmit
> > >
> >
>

Re: [C++][ACERO][DATASET] Ordering in ExecPlan

2024-10-04 Thread Weston Pace

> Currently, it is not
> possible to assign the Implicit ordering in scan node. Such option has
> been added in another nodes[0]. This problem is mentioned here [1]. I
> have started to work on it [2] but I am unsure how to move forward
> because I did not fine any clear roadmap about ordering in general.

The original plans for ordering are quite old at this point (from 2021 it
looks like):
https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit?usp=sharing

The goal was to support multi-threaded ordered execution by tagging batches
with a sequential index, allowing operators to introduce jitter, and
resequencing as needed with small serialized critical sections.

> This also affects asof-join node. Since the node relies on ordered data
> and dataset asserts no implicit ordering, it causes obscure errors in
> threaded execution [3].

Yes, the asof-join node was designed without using the above ordering
concept (I believe it predates the actual implementation of ordering
somewhat or was at least concurrent with it)  It was initially designed to
only be run in serial execution (serial compute, I/O is always parallel
IIRC) and I'm not aware of anyone that changed this but it has been over a
year since I've been actively working on Acero so I'm not sure I would have
noticed.

> I Fixed this node by inserting SerialSequencingQueuein asof-join
> node[5], and adding implicit ordering in dataset. In general I think
> asof-join should require implicit ordering (or any kind of ordering) on
> all inputs.

Yes, something like this is where I hoped the node would eventually end up.

> Could you please review my code? I would appreciate any feedback to help
> improve it. Thanks in advance for your feedback.

I will attempt to look at it over the weekend though maybe someone else
has/will get to it first.  Conceptually, your approach is where I was
hoping things would go.  The scanner, when tagged for ordered output,
should be able to assign an implicit ordering.  The asof join node should
be able to use a sequencer instead of requiring serial execution.

My main concern would be that the asof join node's sidecar thread model
(having a dedicated processing thread) is rather unique to that node and
there were various deadlocks and race conditions encountered getting
everything just right, even for so-called serial execution.  However, as
long as the node's InputReceived is called in-order and one-at-a-time,
which I think is what the SerialSequencingQueue gives you, then perhaps
things are ok.

On Thu, Oct 3, 2024 at 5:27 AM Kamil Tokarek
 wrote:

> Hello,
> I would like to raise the subject of ordering. Currently, it is not
> possible to assign the Implicit ordering in scan node. Such option has
> been added in another nodes[0]. This problem is mentioned here [1]. I
> have started to work on it [2] but I am unsure how to move forward
> because I did not fine any clear roadmap about ordering in general.
>
> This also affects asof-join node. Since the node relies on ordered data
> and dataset asserts no implicit ordering, it causes obscure errors in
> threaded execution [3]. asof-join node also does not sequence the input.
> I Fixed this node by inserting SerialSequencingQueuein asof-join
> node[5], and adding implicit ordering in dataset. In general I think
> asof-join should require implicit ordering (or any kind of ordering) on
> all inputs. I created pull request with following changes[7]:
>
> 1. Assert implicit ordering with
> ScanNodeOptions.require_sequenced_outputoption enabled. [4]
>
> 2. Add SerialSequencingQueuein asof-join node inputs and require
> implicit ordering on input [5]
>
> 3. Modify asof-join node tests to test threaded operation [6]
>
>
> Could you please review my code? I would appreciate any feedback to help
> improve it. Thanks in advance for your feedback.
>
> [0]
>
> https://github.com/apache/arrow/pull/34137/commits/bcc1692dbeb5693508ea89e961b4eaf91170d71d
> <
> https://github.com/apache/arrow/pull/34137/commits/bcc1692dbeb5693508ea89e961b4eaf91170d71d
> >
> [1] https://github.com/apache/arrow/issues/34698
> 
> [2] https://github.com/mroz45/arrow/commits/Ordering/
> 
> [3] https://github.com/apache/arrow/issues/41706
> 
>
> [4]
> _
> https://github.com/mroz45/arrow/commit/7a14586b83641d1bfa1b037f3f2377eb6c911f55_
> <
> https://github.com/mroz45/arrow/commit/7a14586b83641d1bfa1b037f3f2377eb6c911f55
> >
>
> [5]
> _
> https://github.com/apache/arrow/pull/44083/commits/c8047bb8e3d8c83f12070507f3cdc43cb6ee6152_
> <
> https://github.com/apache/arrow/pull/44083/commits/c8047bb8e3d8c83f12070507f3cdc43cb6ee6152
> >
>
> [6]
> _
> https://github.com/apache/arrow/pull/44083/commits/59da79331aefda3fc434e74eb1458cd0e195c879_
> <
> https://github.com/apache/arrow/pull/44083/commits/59da79331aefda3fc434e74eb1458cd0e195c879
> >
>
> [7] _https

Re: [ANNOUNCE] New Arrow committer: Will Ayd

2024-10-01 Thread Weston Pace

Congratulations Will

On Tue, Oct 1, 2024, 2:25 PM Bryce Mecum  wrote:

> Congrats Will!
>
> On Tue, Oct 1, 2024 at 9:55 AM Dewey Dunnington
>  wrote:
> >
> > On behalf of the Arrow PMC, I'm happy to announce that Will Wyd has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> > and thank you for your contributions!
> >
> > -dewey
>

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Weston Pace

It also seems that two variations of the variant encoding are being
discussed.  The original spec, currently housed in Spark, creates a variant
array in row-major order, that is, each element in the array, is contained
contiguously.  So, if you have objects like `{"a": 7, "b": 3}` then the
values for `a` and `b` will be co-located.

There is also a shredded variant, which as I understand it, is not yet
fully designed, where a single array is stored in multiple buffers, one per
field.  This provides for better compression and faster field extraction
and is more favorable to the parquet crowd.

I think I could potentially see an argument for both variants in Arrow (and
capabilities to switch between them).

On Thu, Aug 22, 2024 at 8:49 AM Antoine Pitrou  wrote:

>
> Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit :
> >
> > (I also happen to want a canonical Arrow representation for variant data,
> > as this type occurs in many databases but doesn't have a great
> > representation today in ADBC results. That's why I filed [Format]
> Consider
> > adding an official variant type to Arrow · Issue #42069 · apache/arrow
> > (github.com) . Of course,
> > there's no specific reason why a canonical Arrow representation for
> > variants must align with Spark and/or Iceberg.)
>
> Well, one reason is interoperability, especially as converting between
> different semi-structured encodings like this would probably be expensive.
>
> Regards
>
> Antoine.
>

Re: Seattle Arrow Meetup: August 13th, 2024

2024-08-16 Thread Weston Pace

Notes from the meetup:
https://docs.google.com/document/d/1g0_oiEE0GPQP24LAP3Z8pGSoWnw6C3bsr5If8K7c_Xw/edit?usp=sharing

Thanks to Bryce for taking notes!

On Mon, Aug 12, 2024 at 6:15 AM Weston Pace  wrote:

> The exact location has been updated in the doc.  Looking forward to seeing
> some of you tomorrow.
>
> On Thu, Jul 18, 2024 at 11:41 AM Weston Pace 
> wrote:
>
>> I'd like to announce an Arrow Meetup on August 13th, 2024 from 5:30PM to
>> 7:30PM.  Details can be found at [1].  All are welcome.
>>
>> We will be discussing what’s going on in the Arrow community and what
>> community members have planned or would like to see in the coming years.
>>
>> If you think you can make it then please RSVP by sending an email to
>> weston.p...@gmail.com (just looking for an estimated headcount for
>> planning purposes).
>>
>> If you'd like to present something that you believe fits the agenda then
>> please contact weston.p...@gmail.com.  I think it would be great to have
>> several 10-15 minute presentations if people are able to provide them.
>>
>> [1]
>> https://docs.google.com/document/d/1OL4PJ6sJeiDOEXWEFE1pgyCt84DFQtD61F4uaILQDsY/edit?usp=sharing
>>
>

Re: Seattle Arrow Meetup: August 13th, 2024

2024-08-12 Thread Weston Pace

The exact location has been updated in the doc.  Looking forward to seeing
some of you tomorrow.

On Thu, Jul 18, 2024 at 11:41 AM Weston Pace  wrote:

> I'd like to announce an Arrow Meetup on August 13th, 2024 from 5:30PM to
> 7:30PM.  Details can be found at [1].  All are welcome.
>
> We will be discussing what’s going on in the Arrow community and what
> community members have planned or would like to see in the coming years.
>
> If you think you can make it then please RSVP by sending an email to
> weston.p...@gmail.com (just looking for an estimated headcount for
> planning purposes).
>
> If you'd like to present something that you believe fits the agenda then
> please contact weston.p...@gmail.com.  I think it would be great to have
> several 10-15 minute presentations if people are able to provide them.
>
> [1]
> https://docs.google.com/document/d/1OL4PJ6sJeiDOEXWEFE1pgyCt84DFQtD61F4uaILQDsY/edit?usp=sharing
>

Re: [Discuss] Async interface for C Data Stream interface

2024-08-10 Thread Weston Pace

+1 for getting some kind of async implementation into the spec.  I have
proposed a few alternative approaches in the PR.

On Fri, Aug 9, 2024 at 1:18 PM Matt Topol  wrote:

> Hello All, I'd like to discuss the potential addition of an
> asynchronous-oriented version of the C Data Stream interface.
>
> This idea was originally brought up working with ADBC [1] as the current
> ADBC interface is inherently synchronous and requires any attempts to be
> asynchronous be created/managed/implemented by consumers rather than
> providing an asynchronous API itself. As any such API would require needing
> to handle interactions with the Arrow C Data Interface, it is better to
> upstream such concepts to the Arrow format than create something specific
> to ADBC.
>
> As such, I've created a PR [2] based on the discussions in the ADBC issue,
> as a jumping off point for more discussion. I didn't go as far as creating
> tests, examples, etc. for it yet as the core design of it might change
> based on feedback in this thread and I figured I would wait until I get
> more consensus from the community before I invest more time into it.
>
> Anyways, I'm looking for any and all feedback on the design, structure,
> concept, etc. in the hopes of getting this officially added to the Arrow
> spec.
>
> Thanks everyone!
>
> --Matt
>
> [1]: https://github.com/apache/arrow-adbc/issues/811
> [2]: https://github.com/apache/arrow/pull/43632
>

Re: [VOTE][Format] Bool8 Canonical Extension Type

2024-08-05 Thread Weston Pace

+1 (binding)

Looked through the spec & C++/python PRs.

On Mon, Aug 5, 2024 at 7:41 AM Ian Cook  wrote:

> +1 (non-binding)
>
> I reviewed the spec addition.
>
> On Mon, Aug 5, 2024 at 3:37 PM Antoine Pitrou  wrote:
>
> >
> > Binding +1 (but posted one minor comment on the format PR).
> >
> > Thank you Joel!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 05/08/2024 à 14:59, Joel Lubinitsky a écrit :
> > > Hello Devs,
> > >
> > > I would like to propose a new canonical extension type: Bool8
> > >
> > > The prior mailing list discussion thread can be found at [1].
> > > The format documentation change can be found at [2]. A copy of the text
> > is
> > > included in this email.
> > > A Go implementation can be found at [3].
> > > A C++/Python implementation can be found at [4].
> > >
> > > Thank you for your time and attention in reviewing this proposal.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Accept this proposal
> > > [ ] +0
> > > [ ] -1 Do not accept this proposal because...
> > >
> > > [1]: https://lists.apache.org/thread/nz44qllq53h6kjl3rhy0531n2n2tpfr0
> > > [2]: https://github.com/apache/arrow/pull/43234
> > > [3]: https://github.com/apache/arrow/pull/43323
> > > [4]: https://github.com/apache/arrow/pull/43488
> > >
> > > ---
> > >
> > > 8-bit Boolean
> > > =
> > >
> > > Bool8 represents a boolean value using 1 byte (8 bits) to store each
> > value
> > > instead of only 1 bit as in the original Arrow Boolean type. Although
> > less
> > > compact than the original representation, Bool8 may have better
> zero-copy
> > > compatibility with various systems that also store booleans using 1
> byte.
> > >
> > > * Extension name: ``arrow.bool8``.
> > >
> > > * The storage type of this extension is ``Int8`` where:
> > >
> > >* **false** is denoted by the value ``0``.
> > >* **true** can be specified using any non-zero value. Preferably
> > ``1``.
> > >
> > > * Extension type parameters:
> > >
> > >This type does not have any parameters.
> > >
> > > * Description of the serialization:
> > >
> > >No metadata is required to interpret the type. Any metadata present
> > > should be ignored.
> > >
> >
>

Re: [DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-05 Thread Weston Pace

+1 as well.  32 bit keys were chosen because the expectation was that
hashtable spilling would come along soon.  Since it didn't, I think it's a
good idea to use 64-bit keys until spilling is added.

On Mon, Aug 5, 2024 at 6:05 AM Antoine Pitrou  wrote:

>
> I don't have any concrete data to test this against, but using 64-bit
> offsets sounds like an obvious improvement to me.
>
> Regards
>
> Antoine.
>
>
> Le 01/08/2024 à 13:05, Ruoxi Sun a écrit :
> > Hello everyone,
> >
> > We've identified an issue with Acero's hash join/aggregation, which is
> > currently limited to processing only up to 4GB data due to the use of
> > `uint32_t` for row offsets. This limitation not only impacts our ability
> to
> > handle large datasets but also makes typical solutions like splitting the
> > data into smaller batches ineffective.
> >
> > * Proposed solution
> > We are considering upgrading the row offsets from 32-bit to 64-bit. This
> > change would allow us to process larger datasets and expand Arrow's
> > application possibilities.
> >
> > * Trade-offs to consider
> > ** Pros: Allows handling of larger datasets, breaking the current 4GB
> limit.
> > ** Cons: Each row would consume an additional 4 bytes of memory, and
> there
> > might be slightly more CPU instructions involved in processing.
> >
> > Preliminary benchmarks indicate that the impact on CPU performance is
> > minimal, so the main consideration is the increased memory consumption.
> >
> > * We need your feedback
> > ** How would this change affect your current usage of Arrow, especially
> in
> > terms of memory consumption?
> > ** Do you have any concerns or thoughts about this proposal?
> >
> > Please review the detailed information in [1] and [2] and share your
> > feedback. Your input is crucial as we gather community insights to decide
> > whether or not to proceed with this change.
> >
> > Looking forward to your feedback and working together to enhance Arrow.
> > Thank you!
> >
> > *Regards,*
> > *Rossi SUN*
> >
>

Re: [VOTE][Format] Opaque canonical extension type

2024-07-24 Thread Weston Pace

+1 (binding)

On Wed, Jul 24, 2024 at 8:01 AM Dane Pitkin  wrote:

> +1 (non-binding)
>
> I reviewed the spec and Java implementation.
>
> On Wed, Jul 24, 2024 at 10:37 AM Ian Cook  wrote:
>
> > +1 (non-binding)
> >
> > I reviewed the spec additions.
> >
> > Ian
> >
> > On Wed, Jul 24, 2024 at 10:27 AM Jacob Wujciak-Jens 
> > wrote:
> >
> > > +1 (non binding)
> > >
> > > wish maple  schrieb am Mi., 24. Juli 2024,
> > 15:12:
> > >
> > > > +1 (non-binding)
> > > >
> > > > Checked spec change and C++ impl.
> > > >
> > > > Best,
> > > > Xuwei Fu
> > > >
> > > > Gang Wu  于2024年7月24日周三 20:51写道：
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Checked spec change and C++ impl.
> > > > >
> > > > > On Wed, Jul 24, 2024 at 6:52 PM Joel Lubinitsky <
> joell...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Go implementation LGTM
> > > > > >
> > > > > > On Wed, Jul 24, 2024 at 5:12 AM Raúl Cumplido  >
> > > > wrote:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Format change looks good to me. I haven't reviewed the
> individual
> > > > > > > implementations.
> > > > > > >
> > > > > > > Thanks David for leading this.
> > > > > > >
> > > > > > > El mié, 24 jul 2024 a las 10:51, Joris Van den Bossche
> > > > > > > () escribió:
> > > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Wed, 24 Jul 2024 at 07:34, David Li 
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I'd like to propose the 'Opaque' canonical extension type.
> > > Prior
> > > > > > > discussion can be found at [1] and the proposal and
> > implementations
> > > > for
> > > > > > > C++, Go, Java, and Python can be found at [2]. The proposal is
> > > > > > additionally
> > > > > > > reproduced below.
> > > > > > > > >
> > > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > > >
> > > > > > > > > [ ] +1 Accept this proposal
> > > > > > > > > [ ] +0
> > > > > > > > > [ ] -1 Do not accept this proposal because...
> > > > > > > > >
> > > > > > > > > [1]:
> > > > > > https://lists.apache.org/thread/8d5ldl5cb7mms21rd15lhpfrv4j9no4n
> > > > > > > > > [2]: https://github.com/apache/arrow/pull/41823
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > Opaque represents a type that an Arrow-based system
> received
> > > from
> > > > > an
> > > > > > > external
> > > > > > > > > (often non-Arrow) system, but that it cannot interpret.  In
> > > this
> > > > > > case,
> > > > > > > it can
> > > > > > > > > pass on Opaque to its clients to at least show that a field
> > > > exists
> > > > > > and
> > > > > > > > > preserve metadata about the type from the other system.
> > > > > > > > >
> > > > > > > > > Extension parameters:
> > > > > > > > >
> > > > > > > > > * Extension name: ``arrow.opaque``.
> > > > > > > > >
> > > > > > > > > * The storage type of this extension is any type.  If there
> > is
> > > no
> > > > > > > underlying
> > > > > > > > >   data, the storage type should be Null.
> > > > > > > > >
> > > > > > > > > * Extension type parameters:
> > > > > > > > >
> > > > > > > > >   * **type_name** = the name of the unknown type in the
> > > external
> > > > > > > system.
> > > > > > > > >   * **vendor_name** = the name of the external system.
> > > > > > > > >
> > > > > > > > > * Description of the serialization:
> > > > > > > > >
> > > > > > > > >   A valid JSON object containing the parameters as fields.
> > In
> > > > the
> > > > > > > future,
> > > > > > > > >   additional fields may be added, but all fields current
> and
> > > > future
> > > > > > > are never
> > > > > > > > >   required to interpret the array.
> > > > > > > > >
> > > > > > > > >   Developers **should not** attempt to enable public
> semantic
> > > > > > > interoperability
> > > > > > > > >   of Opaque by canonicalizing specific values of these
> > > > parameters.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Seattle Arrow Meetup: August 13th, 2024

2024-07-18 Thread Weston Pace

I'd like to announce an Arrow Meetup on August 13th, 2024 from 5:30PM to
7:30PM.  Details can be found at [1].  All are welcome.

We will be discussing what’s going on in the Arrow community and what
community members have planned or would like to see in the coming years.

If you think you can make it then please RSVP by sending an email to
weston.p...@gmail.com (just looking for an estimated headcount for planning
purposes).

If you'd like to present something that you believe fits the agenda then
please contact weston.p...@gmail.com.  I think it would be great to have
several 10-15 minute presentations if people are able to provide them.

[1]
https://docs.google.com/document/d/1OL4PJ6sJeiDOEXWEFE1pgyCt84DFQtD61F4uaILQDsY/edit?usp=sharing

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Weston Pace

>I think my question is still relevant: no matter what semantics
`S3FileSystem` is trying to provide, I'm still not sure how the placeholder
object helps. I assume it's for listing objects, but what else?

If I have a local filesystem and I delete a file /foo/bar then I still
expect the directory /foo to exist.


```

mkdir /foo

touch /foo/bar

rm /foo/bar

ls / # should show /foo

```


In an object store there is no `mkdir` and, even if I remove /foo/bar then
there is no guarantee /foo will exist.

On Fri, Jul 12, 2024, 2:50 PM Aldrin  wrote:

> But I think the issue being addressed [1] is essentially, "`delete_file`
> shouldn't create additional files/directories in S3."
>
> I think discussion about the semantics at large is interesting but may be
> a digression? Also, I think there are varying degrees of "filesystem
> semantics" that are even being discussed (the naming system and
> hierarchical inode structure vs atomicity of read/write operations).
>
> I think my question is still relevant: no matter what semantics
> `S3FileSystem` is trying to provide, I'm still not sure how the placeholder
> object helps. I assume it's for listing objects, but what else?
>
>
> [1]: https://github.com/apache/arrow/issues/36275
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Friday, July 12th, 2024 at 14:26, Raphael Taylor-Davies
>  wrote:
>
> > > Many people
> > > are familiar with object stores these days. You could create a new
> > > abstraction `ObjectStore` which is very similar to `FileSystem` except
> the
> > > semantics are object store semantics and not filesystem semantics.
> >
>
> > FWIW in the Arrow Rust ecosystem we only provide an object store
> > abstraction, and this has served us very well. My 2 cents is that object
> > store semantics are sufficient, if not superior [1], than filesystem
> > based interfaces for the vast majority of use cases, with the few
> > workloads that aren't sufficiently served requiring such close
> > integration with often OS-specific filesystem APIs and behaviours as to
> > make building a coherent abstraction extremely difficult.
> >
>
> > Iceberg also took a similar approach with its File IO abstraction [2].
> >
>
> > [1]:
> >
> https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
> > [2]: https://tabular.io/blog/iceberg-fileio-cloud-native-tables/
> >
>
> > On 12/07/2024 22:05, Weston Pace wrote:
> >
>
> > > > The markers are necessary to offer file system semantics on top of
> object
> > > > stores. You will get a ton of subtle bugs otherwise.
> > > > Yes, object stores and filesystems are different. If you expect your
> > > > filesystem to act like a filesystem then these things need to be
> done in
> > > > order to avoid these bugs.
> > >
>
> > > If an option modifies a filesystem to behave more like an object store
> then
> > > I don't think it's necessarily a bad thing as long as it isn't the
> > > default. By turning on the option the user is intentionally altering
> the
> > > behavior and should not be making the same expectations.
> > >
>
> > > On the other hand, there is another approach you could take. Many
> people
> > > are familiar with object stores these days. You could create a new
> > > abstraction `ObjectStore` which is very similar to `FileSystem` except
> the
> > > semantics are object store semantics and not filesystem semantics. I
> > > believe most of our filesystem classes could implement both
> `ObjectStore`
> > > and `FileSystem` abstractions without significant code duplication.
> > >
>
> > > This way, if a user wants filesystem semantics, they use a
> `FileSystem` and
> > > they pay the abstraction cost. If a user is comfortable with
> `ObjectStore`
> > > semantics they use `ObjectStore` and they don't have to pay the costs.
> > >
>
> > > This would be more work than just allowing options to violate
> FileSystem
> > > guarantees but it would provide a more clear distinction between the
> two.
> > >
>
> > > On Fri, Jul 12, 2024 at 9:25 AM Aldrin octalene@pm.me.invalid
> wrote:
> > >
>
> > > > Hello!
> > > >
>
> > > > This may be naive, but why does the empty directory marker need to
> exist
> > > > on the S

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Weston Pace

> The markers are necessary to offer file system semantics on top of object
> stores. You will get a ton of subtle bugs otherwise.

Yes, object stores and filesystems are different.  If you expect your
filesystem to act like a filesystem then these things need to be done in
order to avoid these bugs.

If an option modifies a filesystem to behave more like an object store then
I don't think it's necessarily a bad thing as long as it isn't the
default.  By turning on the option the user is intentionally altering the
behavior and should not be making the same expectations.

On the other hand, there is another approach you could take.  Many people
are familiar with object stores these days.  You could create a new
abstraction `ObjectStore` which is very similar to `FileSystem` except the
semantics are object store semantics and not filesystem semantics.  I
believe most of our filesystem classes could implement both `ObjectStore`
and `FileSystem` abstractions without significant code duplication.

This way, if a user wants filesystem semantics, they use a `FileSystem` and
they pay the abstraction cost.  If a user is comfortable with `ObjectStore`
semantics they use `ObjectStore` and they don't have to pay the costs.

This would be more work than just allowing options to violate FileSystem
guarantees but it would provide a more clear distinction between the two.


On Fri, Jul 12, 2024 at 9:25 AM Aldrin  wrote:

> Hello!
>
> This may be naive, but why does the empty directory marker need to exist
> on the S3 side at all? If a local directory is created (because filesystem
> semantics), then I am not sure why a fake object needs to exist on the
> object-store side.
>
>
>
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Friday, July 12th, 2024 at 08:35, Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > Hi,
> >
>
> > The markers are necessary to offer file system semantics on top of object
> > stores. You will get a ton of subtle bugs otherwise.
> >
>
> > If instead of arrow::FileSystem, Arrow offered an arrow::ObjectStore
> > interface that wraps local filesystems and object stores with
> object-store
> > semantics (i.e. no concept of empty directory or atomic directory
> > deletion), then application developers would have more control of the
> > actions performed on the object store they are using. Cons would be
> slower
> > operations when working with a local filesystem and no concept of
> directory.
> >
>
> > > 1. Add an Option: Introduce an option in S3Options to control
> >
>
> > whether empty directory markers are created, giving users the choice.
> >
>
> > Then it wouldn't be an honest implementation of arrow::FileSystem for the
> > reasons listed above.
> >
>
> > > Change Default Behavior: Modify the default behavior to avoid
> >
>
> > creating empty directory markers when a file is deleted.
> >
>
> > That would bring in the bugs because an arrow::FileSystem instance would
> > behave differently depending on what is backing it.
> >
>
> > > 3. Smarter Directory Creation: Improve the implementation to check
> >
>
> > for other objects in the same path before creating an empty directory
> > marker.
> >
>
> > This might be a problem when more than one client or thread is mutating
> the
> > object store through the arrow::FileSystem. You can check now and once
> > you're done deleting all the other files you thought existed are deleted
> as
> > well. Very likely if clients decide to implement parallel deletion.
> >
>
> > The existing solution of always creating a marker when done is not
> perfect
> > either, but less likely to break.
> >
>
> > ## Suggested Workaround
> >
>
> > Avoiding file by file operations so that internal functions can batch as
> > much as possible.
> >
>
> > --
> > Felipe
> >
>
> >
>
> > On Fri, Jul 12, 2024 at 7:22 AM Hyunseok Seo hsseo0...@gmail.com wrote:
> >
>
> > > Hello. community!
> > >
>
> > > I am currently working on addressing the issue described in [C++]
> Addoption to not create parent directory with S3 delete_file. In this
> process, I have
> > > found it necessary to gather feedback on how to best resolve this
> issue.
> > > Below is a summary and some questions I have for the community.
> > >
>
> > > ### Background
> > > Currently, the S3FileSystem generates an empty directory marker (by
> > > calling the EnsureParentExists function) when a file is deleted and the
> > > directory becomes empty. This behavior maintains the appearance of the
> > > directory structure. However, there have been issues raised by users
> > > regarding this behavior in issues 1.
> > >
>
> > > ### Why Maintain Empty Directory Markers?
> > > From what I understand, object stores like S3 do not have a concept of
> > > directories. The motivation behind maintaining these markers could be
> to
> > > manage the object store as if it were a traditional file system. If
> anyone
> > > kno

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Weston Pace

dict.items():
> > pa_type = _convert_to_arrow_type(field, typ)
> > columns.append(pa.field(field, pa_type))
> > schema = pa.schema(columns)
> > return schema
> >
> > -Original Message-
> > From: Lee, David (PAG) 
> > Sent: Monday, July 8, 2024 11:58 AM
> > To: dev@arrow.apache.org
> > Subject: RE: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I came up with my own json representation that I could put into json /
> > yaml config files with some python code to convert this into a pyarrow
> > schema object..
> >
> > - yaml flat example-
> > fields:
> >   cusip: string
> >   start_date: date32
> >   end_date: date32
> >   purpose: string
> >   source: string
> >   flow: float32
> >   flow_usd: float32
> >   currency: string
> >
> > -yaml nested example-
> > fields:
> >   cusip: string
> >   start_date: date32
> >   regions:
> > [string] << list of strings
> >   primary_benchmark: << struct
> > id: string
> > name: string
> >   all_benchmarks:<< list of structs
> >   -
> > id: string
> > name: string
> >
> > Code:
> >
> > def _convert_to_arrow_type(field, obj):
> > """
> > :param field:
> > :param obj:
> > :returns: pyarrow datatype
> >
> > """
> > if isinstance(obj, list):
> > for child_obj in obj:
> > pa_type = _convert_to_arrow_type(field, child_obj)
> > return pa.list_(pa_type)
> > elif isinstance(obj, dict):
> > items = []
> > for k, child_obj in obj.items():
> > pa_type = _convert_to_arrow_type(k, child_obj)
> > items.append((k, pa_type))
> > return pa.struct(items)
> > else:
> > if isinstance(obj, str):
> > obj = pa.type_for_alias(obj)
> > return obj
> >
> >
> > def _convert_to_arrow_schema(fields_dict):
> > """
> >
> > :param fields_dict:
> > :returns: pyarrow schema
> >
> > """
> > columns = []
> > for field, typ in fields_dict.items():
> > if typ == "timestamp":
> > # default timestamp to microsecond precision
> > typ = "timestamp[us]"
> > elif typ == "date":
> > # default date to date32 which is an alias for date32[day]
> > typ = "date32"
> > elif typ == "int":
> > # default int to int32
> > typ = "int32"
> > pa_type = _convert_to_arrow_type(field, typ)
> > columns.append(pa.field(field, pa_type))
> > schema = pa.schema(columns)
> > return schema
> >
> > -Original Message-
> > From: Weston Pace 
> > Sent: Monday, July 8, 2024 9:43 AM
> > To: dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Approach to generic schema representation
> >
> > External Email: Use caution with links and attachments
> >
> >
> > +1 for empty stream/file as schema serialization.  I have used this
> > approach myself on more than one occasion and it works well.  It can even
> > be useful for transmitting schemas between different arrow-native
> libraries
> > in the same language (e.g. rust->rust) since it allows the different
> > libraries to use different arrow versions.
> >
> > There is one other approach if you only need intra-process serialization
> > (e.g. between threads / libraries in the same process).  You can use the
> C
> > data interface (
> >
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html__;!!KSjYCgUGsB4!ZpcpNRWAd5SeffO0-cFZpVsg1ze7lbt7Btmp3SdyCqvZcsa1naBsVkk2SXPTgQpHRR-fJd_bupsM0-v2oXAljCk$
> > ).
> > It is maybe a slightly more complex API (because of the release callback)
> > and I think it is unlikely to be significantly faster (unless you have an
> > abnormally large schema).  However, it has the same advantages and might
> be
> > useful if you are already using the C data interface elsewhere.
> >
> >
> > On Mon, Jul 8, 2024 at 8:27 AM Matt Topol 
> wrote:
> >
> > > Hey Jeremy,
> > >
> > > Currently the first mess

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Weston Pace

+1 for empty stream/file as schema serialization.  I have used this
approach myself on more than one occasion and it works well.  It can even
be useful for transmitting schemas between different arrow-native libraries
in the same language (e.g. rust->rust) since it allows the different
libraries to use different arrow versions.

There is one other approach if you only need intra-process serialization
(e.g. between threads / libraries in the same process).  You can use the C
data interface (https://arrow.apache.org/docs/format/CDataInterface.html).
It is maybe a slightly more complex API (because of the release callback)
and I think it is unlikely to be significantly faster (unless you have an
abnormally large schema).  However, it has the same advantages and might be
useful if you are already using the C data interface elsewhere.


On Mon, Jul 8, 2024 at 8:27 AM Matt Topol  wrote:

> Hey Jeremy,
>
> Currently the first message of an IPC stream is a Schema message which
> consists solely of a flatbuffer message and defined in the Schema.fbs file
> of the Arrow repo. All of the libraries that can read Arrow IPC should be
> able to also handle converting a single IPC schema message back into an
> Arrow schema without issue. Would that be sufficient for you?
>
> On Mon, Jul 8, 2024 at 11:12 AM Jeremy Leibs  wrote:
>
> > I'm looking for any advice folks may have on a generic way to document
> and
> > represent expected arrow schemas as part of an interface definition.
> >
> > For context, our library provides a cross-language (python, c++, rust)
> SDK
> > for logging semantic multi-modal data (point clouds, images, geometric
> > transforms, bounding boxes, etc.). Each of these primitive types has an
> > associated arrow schema, but to date we have largely abstracted that from
> > our users through language-native object types, and a bunch of generated
> > code to "serialize" stuff into the arrow buffers before transmitting via
> > our IPC.
> >
> > We're trying to take steps in the direction of making it easier for
> > advanced users to write and read data from the store directly using
> arrow,
> > without needing to go in-and-out of an intermediate object-oriented
> > representation. However, doing this means documenting to users, for
> > example: "This is the arrow schema to use when sending a point cloud
> with a
> > color channel".
> >
> > I would love it if, eventually, the arrow project had a way of defining a
> > spec file similar to a .proto or a .fbs, with all libraries supporting
> > loading of a schema object by directly parsing the spec. Has anyone taken
> > steps in this direction?
> >
> > The best alternative I have at the moment is to redundantly define the
> > schema for each of the 3 languages implicitly by directly providing the
> > code to construct a datatype instance with the correct schema. But this
> > feels unfortunately messy and hard to maintain.
> >
> > Thanks,
> > Jeremy
> >
>

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread Weston Pace

pyarrow uses c++ code internally.  With the large files I would guess that
less than 0.1% of your pyarrow benchmark is spent in the python interpreter.

Given this fact, my main advice is to not worry too much about the
difference between pyarrow and carrow.  A lot of work goes into pyarrow to
make sure it not only uses carrow efficiently but also picks the best
carrow APIs and default configurations.

I suspect the reason for the difference is that pyarrow uses the datasets
API internally (I'm pretty sure) even for single file reads now (this
allows us to have consistent behavior).  This is also an asynchronous path
internally.  Even with OMP_NUM_THREADS=1 there still might be some parallel
I/O going on (depends on how many row groups are in your file, etc.)

When the file is small enough then the interpreter overhead is probably
large enough to outweigh any benefits gained from the better configuration
that pyarrow is doing.

On Wed, Jun 12, 2024 at 10:32 PM J N  wrote:

> Hello,
> We all know that there inherent overhead in Python, and we wanted to
> compare the performance of reading data using C++ Arrow against PyArrow for
> high throughput systems. Since I couldn't find any benchmarks online for
> this comparison, I decided to create my own. These programs read a Parquet
> file into arrow::Table in both C++ and Python, and are single threaded.
>
> Carrow benchmark -
> https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
> Pyarrow benchmark -
> https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
>
> Ps: I am new to arrow so some things might be inefficient in both
>
> They read a zstd compressed parquet file of around 300MB.
> The results were very different than what we expected.
> *Pyarrow*
> Total time: 5.347517251968384 seconds
>
> *C++ Arrow*
> Total time: 5.86806 seconds
>
> For smaller files however (0.5MB), c++ arrow was better
>
> *Pyarrow*
> gzip
> Total time: 0.013672113418579102 seconds
>
> *C++ Arrow*
> Total time: 0.00501744 seconds
> (carrow 10x better)
>
> So I have a question to the arrow experts, is this expected in the arrow
> world or is there some error in my benchmark?
>
> Thank you!
>
>
> --
> Warm Regards,
>
> Jay Narale
>

Seattle Arrow meetup (adjacent to post::conf)

2024-05-29 Thread Weston Pace

I've noticed that a number of Arrow people will be in Seattle in August.  I
know there are a number of Arrow contributors that live in the Seattle area
as well.  I'd like to organize a face-to-face meetup for the Arrow
community and have created an issue for discussion[1].  I welcome any
input, feedback, or interest!

Note: as mentioned on the issue, no decisions will be made at the meetup,
this is for community building and general discussion only and I will do my
best to make everything publicly available afterwards.

[1] https://github.com/apache/arrow/issues/41881

Re: [DISCUSS] Drop Java 8 support

2024-05-24 Thread Weston Pace

No vote is required from an ASF perspective (this is not a release)
No vote is required from Arrow conventions (this is not a spec change and
does not impact more than one implementation)

I will send a message to the parquet ML to solicit feedback.

On Fri, May 24, 2024 at 8:22 AM Laurent Goujon 
wrote:

> I would say so because it is akin to removing a large feature but maybe
> some PMC can chime in?
>
> Laurent
>
> On Tue, May 21, 2024 at 12:16 PM Dane Pitkin  wrote:
>
> > I haven't been active in Apache Parquet, but I did not see any prior
> > discussions on this topic in their Jira or dev mailing list.
> >
> > Do we think a vote is needed before officially moving forward with Java 8
> > deprecation?
> >
> > On Mon, May 20, 2024 at 12:50 PM Laurent Goujon
>  > >
> > wrote:
> >
> > > I also mentioned Apache Parquet and haven't seen someone mentioned
> > if/when
> > > Apache Parquet would transition.
> > >
> > >
> > >
> > > On Fri, May 17, 2024 at 9:07 AM Dane Pitkin 
> wrote:
> > >
> > > > Fokko, thank you for these datapoints! It's great to see how other
> low
> > > > level Java OSS projects are approaching this.
> > > >
> > > > JB, I believe yes we have formal consensus to drop Java 8 in Arrow.
> > There
> > > > was no contention in current discussions across [GitHub issues |
> Arrow
> > > > Mailing List | Community Syncs].
> > > >
> > > > We can save Java 11 deprecation for a future discussion. For users on
> > > Java
> > > > 11, I do anticipate this discussion to come shortly after Java 8
> > > > deprecation is released.
> > > >
> > > > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong 
> > > > wrote:
> > > >
> > > > > I was traveling the last few weeks, so just a follow-up from my
> end.
> > > > >
> > > > > Fokko, can you elaborate on the discussions held in other OSS
> > projects
> > > to
> > > > >> drop Java <17? How did they weigh the benefits/drawbacks for
> > dropping
> > > > both
> > > > >> Java 8 and 11 LTS versions? I'd also be curious if other projects
> > plan
> > > > to
> > > > >> support older branches with security patches.
> > > > >
> > > > >
> > > > > So, the ones that I'm involved with (including a TLDR):
> > > > >
> > > > >- Avro:
> > > > >   - (April 2024: Consensus on moving to 11+, +1 for moving to
> > 17+)
> > > > >
> > https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > > > >   - (Jan 2024: Consensus on dropping 8)
> > > > >
> > https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > > > >   - Iceberg:
> > > > >   - (Jan 2023: Concerns about Hive):
> > > > >
> > https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > > > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to
> JDK11+,
> > > > >   also +1's for moving to 17+):
> > > > >
> > https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > > > >
> > > > > I think the most noteworthy (slow-moving in general):
> > > > >
> > > > >- Spark 4 supports JDK 17+
> > > > >- Hive 4 is still on Java 8
> > > > >
> > > > >
> > > > > It looks like most of the projects are looking at each other. Keep
> in
> > > > > mind, that projects that still support older versions of Java, can
> > > still
> > > > > use older versions of Arrow.
> > > > >
> > > > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > > > (in case the image doesn't come through, that's Spiderman pointing
> at
> > > > > Spiderman)
> > > > >
> > > > > Concerning the Java 11 support, some data:
> > > > >
> > > > >- Oracle 11: support until January 2032 (extended fee has been
> > > waived)
> > > > >- Cornetto 11: September 2027
> > > > >- Adoptium 11: At least Oct 2027
> > > > >- Zulu 11: Jan 2032
> > > > >- OpenJDK11: October 2024
> > > > >
> > > > > I think it is fair to support 11 for the time being, but at some
> > point,
> > > > we
> > > > > also have to move on and start exploiting the new features and make
> > > sure
> > > > > that we keep up to date. For example, Java 8 also has extended
> > support
> > > > > until 2030. Dependabot on the Iceberg project
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3Adependencies
> > > > >
> > > > > nicely shows which projects are already at JDK11+ :)
> > > > >
> > > > > Thanks Dane for driving this!
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Op vr 17 mei 2024 om 07:44 schreef Jean-Baptiste Onofré <
> > > j...@nanthrax.net
> > > > >:
> > > > >
> > > > >> Hi Dane
> > > > >>
> > > > >> Do we have a formal consensus about Java version in regards of
> arrow
> > > > >> version ?
> > > > >> I agree with the plan but just wondering if it’s ok from everyone
> > with
> > > > the
> > > > >> community.
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >> Le jeu. 16 mai 2024 à 18:05, Dane Pitkin  a
> > > écrit :
> > > > >>
> > > > >> > To wra

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Weston Pace

> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

This has been something that has always been desired for the Arrow IPC
format too.

My preference would be (apologies if this has been mentioned before):

- Agree on how statistics should be encoded into an array (this is not
hard, we just have to agree on the field order and the data type for
null_count)
- If you need statistics in the schema then simply encode the 1-row batch
into an IPC buffer (using the streaming format) or maybe just an IPC
RecordBatch message since the schema is fixed and store those bytes in the
schema



On Fri, May 24, 2024 at 1:20 AM Sutou Kouhei  wrote:

> Hi,
>
> Could you explain more about your idea? Does it propose that
> we add more callbacks to ArrowArrayStream such as
> ArrowArrayStream::get_statistics()? Or Does it propose that
> we define one more Arrow C XXX interface that wraps
> ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
>
> ArrowDeviceArray:
> https://arrow.apache.org/docs/format/CDeviceDataInterface.html
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May
> 2024 06:55:40 -0700,
>   Curt Hagenlocher  wrote:
>
> >>  would it be easier to request statistics at a higher level of
> > abstraction?
> >
> > What if there were a "single table provider" level of abstraction between
> > ADBC and ArrowArrayStream as a C API; something that can report
> statistics
> > and apply simple predicates?
> >
> > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
> >  wrote:
> >
> >> Thank you for the background! I understand that these statistics are
> >> important for query planning; however, I am not sure that I follow why
> >> we are constrained to the ArrowSchema to represent them. The examples
> >> given seem to going through Python...would it be easier to request
> >> statistics at a higher level of abstraction? There would already need
> >> to be a separate mechanism to request an ArrowArrayStream with
> >> statistics (unless the PyCapsule `requested_schema` argument would
> >> suffice).
> >>
> >> > ADBC may be a bit larger to use only for transmitting
> >> > statistics. ADBC has statistics related APIs but it has more
> >> > other APIs.
> >>
> >> Some examples of producers given in the linked threads (Delta Lake,
> >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> >> can implement an ADBC driver without defining all the methods (where
> >> the producer could call AdbcConnectionGetStatistics(), although
> >> AdbcStatementGetStatistics() might be more relevant here and doesn't
> >> exist). One example listed (using an Arrow Table as a source) seems a
> >> bit light to wrap in an ADBC driver; however, it would not take much
> >> code to do so and the overhead of getting the reader via ADBC it is
> >> something like 100 microseconds (tested via the ADBC R package's
> >> "monkey driver" which wraps an existing stream as a statement). In any
> >> case, the bulk of the code is building the statistics array.
> >>
> >> > How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >>
> >> Whatever format for statistics is decided on, I imagine it should be
> >> exactly the same as the ADBC standard? (Perhaps pushing changes
> >> upstream if needed?).
> >>
> >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > > Why not simply pass the statistics ArrowArray separately in your
> >> > > producer API of choice
> >> >
> >> > It seems that we should use the approach because all
> >> > feedback said so. How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >> >
> >> > | Field Name   | Field Type| Comments |
> >> > |--|---|  |
> >> > | column_name  | utf8  | (1)  |
> >> > | statistic_key| utf8 not null | (2)  |
> >> > | statistic_value  | VALUE_SCHEMA not null |  |
> >> > | statistic_is_approximate | bool not null | (3)  |
> >> >
> >> > 1. If null, then the statistic applies to the entire table.
> >> >It's for "row_count".
> >> > 2. We'll provide pre-defined keys such as "max", "min",
> >> >"byte_width" and "distinct_count" but users can also use
> >> >application specific keys.
> >> > 3. If true, then the value is approximate or best-effort.
> >> >
> >> > VALUE_SCHEMA is a dense union with members:
> >> >
> >> > | Field Name | Field Type |
> >> > |||
> >> > | int64  | int64  |
> >> > | uint64 | uint64 |
> >> > | float64| float64|
> >> > | binary | binary |
> >> >
> >> > If a column is an int32 column, it uses int64 for
> >> > "max"/"min". We don't provide all types here. Users should
> >> > use a compatible type (int64 for a int32 column) in

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-20 Thread Weston Pace

+1 (binding)

I also tested on Ubuntu 22.04 with USE_CONDA=1
dev/release/verify-release-candidate.sh 12 4

On Mon, May 20, 2024 at 5:20 AM David Li  wrote:

> My vote: +1 (binding)
>
> Are any other PMC members able to take a look?
>
> On Fri, May 17, 2024, at 23:36, Dewey Dunnington wrote:
> > +1 (binding)
> >
> > Tested with MacOS M1 using TEST_YUM=0 TEST_APT=0 USE_CONDA=1
> > ./verify-release-candidate.sh 12 4
> >
> > On Fri, May 17, 2024 at 9:46 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> +1 (non binding)
> >>
> >> Testing on MacOS M2.
> >>
> >> Regards
> >> JB
> >>
> >> On Wed, May 15, 2024 at 7:00 AM David Li  wrote:
> >> >
> >> > Hello,
> >> >
> >> > I would like to propose the following release candidate (RC4) of
> Apache Arrow ADBC version 12. This is a release consisting of 56 resolved
> GitHub issues [1].
> >> >
> >> > Please note that the versioning scheme has changed.  This is the 12th
> release of ADBC, and so is called version "12".  The subcomponents,
> however, are versioned independently:
> >> >
> >> > - C/C++/GLib/Go/Python/Ruby: 1.0.0
> >> > - C#: 0.12.0
> >> > - Java: 0.12.0
> >> > - R: 0.12.0
> >> > - Rust: 0.12.0
> >> >
> >> > These are the versions you will see in the source and in actual
> packages.  The next release will be "13", and the subcomponents will
> increment their versions independently (to either 1.1.0, 0.13.0, or
> 1.0.0).  At this point, there is no plan to release subcomponents
> independently from the project as a whole.
> >> >
> >> > Please note that there is a known issue when using the Flight SQL and
> Snowflake drivers at the same time on x86_64 macOS [12].
> >> >
> >> > This release candidate is based on commit:
> 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
> >> >
> >> > The source release rc4 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8].
> >> > The changelog is located at [9].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [10] for how to validate a release candidate.
> >> >
> >> > See also a verification result on GitHub Actions [11].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow ADBC 12
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow ADBC 12 because...
> >> >
> >> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export
> TEST_APT=0 TEST_YUM=0`.)
> >> >
> >> > [1]:
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+12%22+is%3Aclosed
> >> > [2]:
> https://github.com/apache/arrow-adbc/commit/50cb9de621c4d72f4aefd18237cb4b73b82f4a0e
> >> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-12-rc4/
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [7]:
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> > [8]:
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-12-rc4
> >> > [9]:
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-12-rc4/CHANGELOG.md
> >> > [10]:
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> > [11]: https://github.com/apache/arrow-adbc/actions/runs/9089931356
> >> > [12]: https://github.com/apache/arrow-adbc/issues/1841
>

Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread Weston Pace

Congrats Dane!

On Tue, May 7, 2024, 7:30 AM Nic Crane  wrote:

> Congrats Dane, well deserved!
>
> On Tue, 7 May 2024 at 15:16, Gang Wu  wrote:
> >
> > Congratulations Dane!
> >
> > Best,
> > Gang
> >
> > On Tue, May 7, 2024 at 10:12 PM Ian Cook  wrote:
> >
> > > Congratulations Dane!
> > >
> > > On Tue, May 7, 2024 at 10:10 AM Alenka Frim  > > .invalid>
> > > wrote:
> > >
> > > > Yay, congratulations Dane!!
> > > >
> > > > On Tue, May 7, 2024 at 4:00 PM Rok Mihevc 
> wrote:
> > > >
> > > > > Congrats Dane!
> > > > >
> > > > > Rok
> > > > >
> > > > > On Tue, May 7, 2024 at 3:57 PM wish maple 
> > > > wrote:
> > > > >
> > > > > > Congrats!
> > > > > >
> > > > > > Best,
> > > > > > Xuwei Fu
> > > > > >
> > > > > > Joris Van den Bossche 
> 于2024年5月7日周二
> > > > > 21:53写道：
> > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Dane
> Pitkin
> > > > has
> > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > Welcome,
> > > > > > > and thank you for your contributions!
> > > > > > >
> > > > > > > Joris
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Weston Pace

I think "inheritance" and "composition" are more concerns for
implementations than they are for spec (I could be wrong here).

So it seems that it would be sufficient to write the HLLSKETCH's canonical
definition as "this is an extension of the JSON logical type and supports
all the same storage types" and then allow implementations to use whatever
inheritance / composition scheme they want to behind the scenes.

On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:

> I think the biggest blocker to doing this is the way that we pass extension
> types through IPC. Extension types are sent as their underlying storage
> type with metadata key-value pairs of specific keys "ARROW:extension:name"
> and "ARROW:extension:metadata". Since you can't have multiple values for
> the same key in the metadata, this would prevent the ability to define an
> extension type in terms of another extension type as you wouldn't be able
> to include the metadata for the second-level extension part.
>
> i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> storage type. So the storage type needs to be a valid core Arrow data type
> for this reason.
>
> On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
>
> > The vote on adding a JSON canonical extension type [1] got me wondering:
> Is
> > it possible to define an extension type that is based on a canonical
> > extension type? If so, how?
> >
> > For example, say I wanted to define a (non-canonical) HLLSKETCH extension
> > type that corresponds to the type that Redshift uses for HyperLogLog
> > sketches and is represented as JSON [2]. Is there a way to do this by
> > building on the JSON canonical extension type?
> >
> > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> >
> > Ian
> >
>

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Weston Pace

+1 (binding)

On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc  wrote:

> Thanks for all the reviews and comments! I've included the big-endian
> requirement so the proposed language is now as below.
> I'll leave the vote open until after the May holiday.
>
> Rok
>
> UUID
> 
>
> * Extension name: `arrow.uuid`.
>
> * The storage type of the extension is ``FixedSizeBinary`` with a length of
> 16 bytes.
>
> .. note::
>A specific UUID version is not required or guaranteed. This extension
> represents
>UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
> interpret the bytes in any way.
>

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Weston Pace

+1 (binding)

I agree we should be explicit about RFC-8259

On Mon, Apr 29, 2024 at 4:46 PM David Li  wrote:

> +1 (binding)
>
> assuming we explicitly state RFC-8259
>
> On Tue, Apr 30, 2024, at 08:02, Matt Topol wrote:
> > +1 (binding)
> >
> > On Mon, Apr 29, 2024 at 5:36 PM Ian Cook  wrote:
> >
> >> +1 (non-binding)
> >>
> >> I added a comment in the PR suggesting that we explicitly refer to
> RFC-8259
> >> in CanonicalExtensions.rst.
> >>
> >> On Mon, Apr 29, 2024 at 1:21 PM Micah Kornfield 
> >> wrote:
> >>
> >> > +1, I added a comment to the PR because I think we should recommend
> >> > implementations specifically reject parsing Binary arrays with the
> >> > annotation in-case we want to support non-UTF8 encodings in the future
> >> > (even thought IIRC these aren't really JSON spec compliant).
> >> >
> >> > On Fri, Apr 19, 2024 at 1:24 PM Rok Mihevc 
> wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > Following discussions [1][2] and preliminary implementation work (by
> >> > > Pradeep Gollakota) [3] I would like to propose a vote to add
> language
> >> for
> >> > > JSON canonical extension type to CanonicalExtensions.rst as in PR
> [4]
> >> and
> >> > > written below.
> >> > > A draft C++ implementation PR can be seen here [3].
> >> > >
> >> > > [1]
> https://lists.apache.org/thread/p3353oz6lk846pnoq6vk638tjqz2hm1j
> >> > > [2]
> https://lists.apache.org/thread/7xph3476g9rhl9mtqvn804fqf5z8yoo1
> >> > > [3] https://github.com/apache/arrow/pull/13901
> >> > > [4] https://github.com/apache/arrow/pull/41257 <- proposed change
> >> > >
> >> > >
> >> > > The vote will be open for at least 72 hours.
> >> > >
> >> > > [ ] +1 Accept this proposal
> >> > > [ ] +0
> >> > > [ ] -1 Do not accept this proposal because...
> >> > >
> >> > >
> >> > > JSON
> >> > > 
> >> > >
> >> > > * Extension name: `arrow.json`.
> >> > >
> >> > > * The storage type of this extension is ``StringArray`` or
> >> > >   or ``LargeStringArray`` or ``StringViewArray``.
> >> > >   Only UTF-8 encoded JSON is supported.
> >> > >
> >> > > * Extension type parameters:
> >> > >
> >> > >   This type does not have any parameters.
> >> > >
> >> > > * Description of the serialization:
> >> > >
> >> > >   Metadata is either an empty string or a JSON string with an empty
> >> > object.
> >> > >   In the future, additional fields may be added, but they are not
> >> > required
> >> > >   to interpret the array.
> >> > >
> >> > >
> >> > >
> >> > > Rok
> >> > >
> >> >
> >>
>

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Weston Pace

df situation, it came up with what happens if you pass a
> non-struct
> > > column to the from_arrow_device method which returns a cudf::table?
> > Should
> > > it error, or should it create a table with a single column?
> >
> > Presumably it should just error? I can see this being ambiguous if there
> > were an API that dynamically returned either a table or a column based on
> > the input shape (where before it would be less ambiguous since you'd
> > explicitly pass pa.RecordBatch or pa.Array, and now it would be ambiguous
> > since you only pass ArrowDeviceArray). But it doesn't sound like that's
> the
> > case?
> >
> > On Tue, Apr 23, 2024, at 11:15, Weston Pace wrote:
> > > I tend to agree with Dewey.  Using run-end-encoding to represent a
> scalar
> > > is clever and would keep the c data interface more compact.  Also, a
> > struct
> > > array is a superset of a record batch (assuming the metadata is kept in
> > the
> > > schema).  Consumers should always be able to deserialize into a struct
> > > array and then downcast to a record batch if that is what they want to
> do
> > > (raising an error if there happen to be nulls).
> > >
> > >> Depending on the function in question, it could be valid to pass a
> > struct
> > >> column vs a record batch with different results.
> > >
> > > Are there any concrete examples where this is the case?  The closest
> > > example I can think of is something like the `drop_nulls` function,
> > which,
> > > given a record batch, would choose to drop rows where any column is
> null
> > > and, given an array, only drops rows where the top-level struct is
> null.
> > > However, it might be clearer to just give the two functions different
> > names
> > > anyways.
> > >
> > > On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
> > >  wrote:
> > >
> > >> Thank you for the background!
> > >>
> > >> I still wonder if these distinctions are the responsibility of the
> > >> ArrowSchema to communicate (although perhaps links to the specific
> > >> discussions would help highlight use-cases that I am not envisioning).
> > >> I think these distinctions are definitely important in the contexts
> > >> you mentioned; however, I am not sure that the FFI layer is going to
> > >> be helpful.
> > >>
> > >> > In the libcudf situation, it came up with what happens if you pass a
> > >> non-struct
> > >> > column to the from_arrow_device method which returns a cudf::table?
> > >> Should
> > >> > it error, or should it create a table with a single column?
> > >>
> > >> I suppose that I would have expected two functions (one to create a
> > >> table and one to create a column). As a consumer I can't envision a
> > >> situation where I would want to import an ArrowDeviceArray but where I
> > >> would want some piece of run-time information to decide what the
> > >> return type of the function would be? (With apologies if I am missing
> > >> a piece of the discussion).
> > >>
> > >> > If A and B have different lengths, this is invalid
> > >>
> > >> I believe several array implementations (e.g., numpy, R) are able to
> > >> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> > >> that would make that broadcast explicit without expanding the scalar.
> > >>
> > >> > Depending on the function in question, it could be valid to pass a
> > >> struct column vs a record batch with different results.
> > >>
> > >> If this is an important distinction for an FFI signature of a UDF,
> > >> there would probably be a struct definition for the UDF where there
> > >> would be an opportunity to make this distinction (and perhaps others
> > >> that are relevant) without loading this concept onto the existing
> > >> structs.
> > >>
> > >> > If no flags are set, then the behavior shouldn't change
> > >> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,
> then
> > it
> > >> > should error unless calling ImportRecordBatch.
> > >>
> > >> I am not sure I would have expected that (since a struct array has an
> > >> unambiguous interpretation as a record batch and as a user I've very
> > >> explicitly dec

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Weston Pace

> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though
code
> is available ) .*

This would be news to me.  Parquet releases are listed (by the parquet
community) at [1]

The vote to release parquet 2.10 is here: [2]

Neither of these links mention anything about this being an experimental,
unofficial, or non-finalized release.

I understand your concern.  I believe your quotes are coming from your
discussion on the parquet mailing list here [3].  This communication is
unfortunate and confusing to me as well.

[1] https://parquet.apache.org/blog/
[2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
[3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3


On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo  wrote:

> Hello Jacob,
> Thanks for the information, and my apologies for the weird format of my
> email.
>
> This is the email from the Parquet community. May I know why pyarrow is
> using Parquet V2 which is not official yet ?
>
> My question is from Parquet community V2 is not final yet so it is not
> official yet.
> "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet V2
> as a standard isn't finalized just yet. Meaning there is no formal,
> *finalized* "contract" that specifies what it means to write data in the V2
> version. The discussions/conversations about what the final V2 standard may
> be are still in progress and are evolving.
>
> That being said, because V2 code does exist (though unfinalized), there are
> clients / tools that are writing data in the un-finalized V2 format, as
> seems to be the case with Dremio.
>
> Now, as that comment you quoted said, you can have Spark write V2 files,
> but it's worth being mindful about the fact that V2 is a moving target and
> can (and likely will) change. You can overwrite parquet.writer.version to
> specify your desired version, but it can be dangerous to produce data in a
> moving-target format. For example, let's say you write a bunch of data in
> Parquet V2, and then the community decides to make a breaking change (which
> is completely fine / allowed since V2 isn't finalized). You are now left
> having to deal with a potentially large and complicated file format update.
> That's why it's not recommended to write files in parquet v2 just yet."
>
>
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though code
> is available ) .*
>
>
> *As per above Spark hasn't started using Parquet V2 for writing *.
>
> May I know how an unstable /unofficial  version is being used in pyarrow ?
>
>
> On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak 
> wrote:
>
> > Hello,
> >
> > First off, please try to clean up formating of emails to be legible when
> > forwarding/quoting previous messages multiple times, especially when most
> > of the quotes do not contain any useful information. It makes it much
> > easier to parse the message and thus quicker to answer.
> >
> > The short answer is that we switched to 2.4 and more recently to 2.6 as
> > the default to enable the usage of features these versions provide. As
> you
> > have correctly quoted from the docs you can still write 1.0 if you want
> to
> > ensure compatibility with systems that can not process the 'newer'
> versions
> > yet (2.6 was released in 2018!).
> >
> > You can find the long form discussions about these changes here:
> > https://issues.apache.org/jira/browse/ARROW-12203
> > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> >
> > Best
> > Jacob
> >
> > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > Hello Team,
> > > Could you please share your thoughts about below questions?
> > > Sent from my iPhone
> > >
> > > Begin forwarded message:
> > >
> > > > From: Prem Sahoo 
> > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > To: dev-ow...@arrow.apache.org
> > > > Subject: Re: PyArrow Using Parquet V2
> > > >
> > > > dev@arrow.apache.org
> > > > Sent from my iPhone
> > > >
> > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo 
> > wrote:
> > > >>>
> > > >> Hello Team,
> > > >> Could anyone please help me on below query?
> > > >> Sent from my iPhone
> > > >>
> > >  On Apr 22, 2024, at 10:01 PM, Prem Sahoo 
> > wrote:
> > > 
> > > >>> 
> > > >>> Sent from my iPhone
> > > >>>
> > > > On Apr 22, 2024, at 9:51 PM, Prem Sahoo 
> > wrote:
> > > >
> > >  
> > > 
> > > >
> > > > 
> > > > Hello Team,
> > > > I have a question regarding Parquet V2 writing thro pyarrow .
> > > > As per below Pyarrow started writing Parquet in V2 encoding.
> > > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > >
> > > > version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > Determine which Parquet logical types are available for use,
> > whether the reduced set fro

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-22 Thread Weston Pace

I tend to agree with Dewey.  Using run-end-encoding to represent a scalar
is clever and would keep the c data interface more compact.  Also, a struct
array is a superset of a record batch (assuming the metadata is kept in the
schema).  Consumers should always be able to deserialize into a struct
array and then downcast to a record batch if that is what they want to do
(raising an error if there happen to be nulls).

> Depending on the function in question, it could be valid to pass a struct
> column vs a record batch with different results.

Are there any concrete examples where this is the case?  The closest
example I can think of is something like the `drop_nulls` function, which,
given a record batch, would choose to drop rows where any column is null
and, given an array, only drops rows where the top-level struct is null.
However, it might be clearer to just give the two functions different names
anyways.

On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
 wrote:

> Thank you for the background!
>
> I still wonder if these distinctions are the responsibility of the
> ArrowSchema to communicate (although perhaps links to the specific
> discussions would help highlight use-cases that I am not envisioning).
> I think these distinctions are definitely important in the contexts
> you mentioned; however, I am not sure that the FFI layer is going to
> be helpful.
>
> > In the libcudf situation, it came up with what happens if you pass a
> non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
>
> I suppose that I would have expected two functions (one to create a
> table and one to create a column). As a consumer I can't envision a
> situation where I would want to import an ArrowDeviceArray but where I
> would want some piece of run-time information to decide what the
> return type of the function would be? (With apologies if I am missing
> a piece of the discussion).
>
> > If A and B have different lengths, this is invalid
>
> I believe several array implementations (e.g., numpy, R) are able to
> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> that would make that broadcast explicit without expanding the scalar.
>
> > Depending on the function in question, it could be valid to pass a
> struct column vs a record batch with different results.
>
> If this is an important distinction for an FFI signature of a UDF,
> there would probably be a struct definition for the UDF where there
> would be an opportunity to make this distinction (and perhaps others
> that are relevant) without loading this concept onto the existing
> structs.
>
> > If no flags are set, then the behavior shouldn't change
> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> > should error unless calling ImportRecordBatch.
>
> I am not sure I would have expected that (since a struct array has an
> unambiguous interpretation as a record batch and as a user I've very
> explicitly decided that I want one, since I'm using that function).
>
> In the other direction, I am not sure a producer would be able to set
> these flags without breaking backwards compatibility with earlier
> producers that did not set them (since earlier threads have suggested
> that it is good practice to error when an unsupported flag is
> encountered).
>
> On Sun, Apr 21, 2024 at 6:16 PM Matt Topol  wrote:
> >
> > First, I forgot a flag in my examples. There should also be an
> > ARROW_FLAG_SCALAR too!
> >
> > The motivation for this distinction came up from discussions during
> adding
> > support for ArrowDeviceArray to libcudf in order to better indicate the
> > difference between a cudf::table and a cudf::column which are handled
> quite
> > differently. This also relates to the fact that we currently need
> external
> > context like the explicit ImportArray() and ImportRecordBatch() functions
> > since we can't determine which a given ArrowArray is on its own. In the
> > libcudf situation, it came up with what happens if you pass a non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
> >
> > The other motivation for this distinction is with UDFs in an engine that
> > uses the C data interface. When dealing with queries and engines, it
> > becomes important to be able to distinguish between a record batch, a
> > column and a scalar. For example, take the expression A + B:
> >
> > If A and B have different lengths, this is invalid. unless one of
> them
> > is a Scalar. This is because Scalars are broadcastable, columns are not.
> >
> > Depending on the function in question, it could be valid to pass a struct
> > column vs a record batch with different results. It also resolves some
> > ambiguity for UDFs and processing. For instance, given a single
> ArrowArray
> > of length 1, which is a struct: Is that a Struct Colum

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace

> people generally find use in Arrow schemas independently of concrete data.

This makes sense.  I think we do want to encourage use of Arrow as a "type
system" even if there is no data involved.  And, given that we cannot
easily change a field's data type property to "optional" it makes sense to
use a dedicated type and I so I would be in favor of such a proposal (we
may eventually add an "unknown type" concept in Substrait as well, it's
come up several times, and so we could use this in that context).

I think that I would still prefer a canonical extension type (with storage
type null) over a new dedicated type.

On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou  wrote:

>
> Ah! Well, I think this could be an interesting proposal, but someone
> should put a more formal proposal, perhaps as a draft PR.
>
> Regards
>
> Antoine.
>
>
> Le 17/04/2024 à 11:57, David Li a écrit :
> > For an unsupported/other extension type.
> >
> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
> >> What is "this proposal"?
> >>
> >>
> >> Le 17/04/2024 à 10:38, David Li a écrit :
> >>> Should I take it that this proposal is dead in the water? While we
> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
> it might be useful to have a singular type for consumers to latch on to.
> >>>
> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>  I think an "Other" extension type is slightly different than an
>  arbitrary extension type, though: the latter may be understood
>  downstream but the former represents a point at which a component
>  explicitly declares it does not know how to handle a field. In this
>  example, the PostgreSQL ADBC driver might be able to provide a
>  representation regardless, but a different driver (or say, the JDBC
>  adapter, which cannot necessarily get a bytestring for an arbitrary
>  JDBC type) may want an Other type to signal that it would fail if
> asked
>  to provide particular columns.
> 
>  On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
> > Depending where your Arrow-encoded data is used, either extension
> > types or generic field metadata are options. We have this problem in
> > the ADBC Postgres driver, where we can convert *most* Postgres types
> > to an Arrow type but there are some others where we can't or don't
> > know or don't implement a conversion. Currently for these we return
> > opaque binary (the Postgres COPY representation of the value) but put
> > field metadata so that a consumer can implement a workaround for an
> > unsupported type. It would be arguably better to have implemented
> this
> > as an extension type; however, field metadata felt like less of a
> > commitment when I first worked on this.
> >
> > Cheers,
> >
> > -dewey
> >
> > On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
> >  wrote:
> >>
> >> I was using UUID as an example. It looks like extension types
> covers my original request.
> >> 
> >> From: Felipe Oliveira Carvalho 
> >> Sent: Thursday, April 11, 2024 7:15 AM
> >> To: dev@arrow.apache.org 
> >> Subject: Re: Unsupported/Other Type
> >>
> >> The OP used UUID as an example. Would that be enough or the request
> is for
> >> a flexible mechanism that allows the creation of one-off nominal
> types for
> >> very specific use-cases?
> >>
> >> —
> >> Felipe
> >>
> >> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Yes, JSON and UUID are obvious candidates for new canonical
> extension
> >>> types. XML also comes to mind, but I'm not sure there's much of a
> use
> >>> case for it.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>  In the past we have discussed adding a canonical type for UUID
> and JSON.
> >>> I
>  still think this is a good idea and could improve ergonomics in
> >>> downstream
>  language bindings (e.g. by exposing JSON querying function or
> >>> automatically
>  boxing UUIDs in built-in UUID types, like the Python uuid
> library). Has
>  anyone done any work on this to anyone's knowledge?
> 
>  On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
> emkornfi...@gmail.com>
>  wrote:
> 
> > Hi Norman,
> > Arrow has a concept of extension types [1] along with the
> possibility of
> > proposing new canonical extension types [2].  This seems to
> cover the
> > use-cases you mention but I might be misunderstanding?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> >>>
> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
> > [2]
> https://arrow.apache.org/docs/format/CanonicalExten

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace

> may want an Other type to signal that it would fail if asked to provide
particular columns.

I interpret "would fail" to mean we are still speaking in some kind of
"planning stage" and not yet actually creating arrays.  So I don't know
that this needs to be a data type.  In other words, I see this as
`std::optional` and not a unique instance of `DataType`.

However, if you did need to actually create an array, and you wanted some
way of saying "there is no data here because I failed to interpret the
type" then maybe you could create an extension type based on the null type?

On Wed, Apr 17, 2024 at 2:57 AM David Li  wrote:

> For an unsupported/other extension type.
>
> On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
> > What is "this proposal"?
> >
> >
> > Le 17/04/2024 à 10:38, David Li a écrit :
> >> Should I take it that this proposal is dead in the water? While we
> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
> it might be useful to have a singular type for consumers to latch on to.
> >>
> >> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
> >>> I think an "Other" extension type is slightly different than an
> >>> arbitrary extension type, though: the latter may be understood
> >>> downstream but the former represents a point at which a component
> >>> explicitly declares it does not know how to handle a field. In this
> >>> example, the PostgreSQL ADBC driver might be able to provide a
> >>> representation regardless, but a different driver (or say, the JDBC
> >>> adapter, which cannot necessarily get a bytestring for an arbitrary
> >>> JDBC type) may want an Other type to signal that it would fail if asked
> >>> to provide particular columns.
> >>>
> >>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>  Depending where your Arrow-encoded data is used, either extension
>  types or generic field metadata are options. We have this problem in
>  the ADBC Postgres driver, where we can convert *most* Postgres types
>  to an Arrow type but there are some others where we can't or don't
>  know or don't implement a conversion. Currently for these we return
>  opaque binary (the Postgres COPY representation of the value) but put
>  field metadata so that a consumer can implement a workaround for an
>  unsupported type. It would be arguably better to have implemented this
>  as an extension type; however, field metadata felt like less of a
>  commitment when I first worked on this.
> 
>  Cheers,
> 
>  -dewey
> 
>  On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>   wrote:
> >
> > I was using UUID as an example. It looks like extension types covers
> my original request.
> > 
> > From: Felipe Oliveira Carvalho 
> > Sent: Thursday, April 11, 2024 7:15 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: Unsupported/Other Type
> >
> > The OP used UUID as an example. Would that be enough or the request
> is for
> > a flexible mechanism that allows the creation of one-off nominal
> types for
> > very specific use-cases?
> >
> > —
> > Felipe
> >
> > On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 
> wrote:
> >
> >>
> >> Yes, JSON and UUID are obvious candidates for new canonical
> extension
> >> types. XML also comes to mind, but I'm not sure there's much of a
> use
> >> case for it.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
> >>> In the past we have discussed adding a canonical type for UUID and
> JSON.
> >> I
> >>> still think this is a good idea and could improve ergonomics in
> >> downstream
> >>> language bindings (e.g. by exposing JSON querying function or
> >> automatically
> >>> boxing UUIDs in built-in UUID types, like the Python uuid
> library). Has
> >>> anyone done any work on this to anyone's knowledge?
> >>>
> >>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>> wrote:
> >>>
>  Hi Norman,
>  Arrow has a concept of extension types [1] along with the
> possibility of
>  proposing new canonical extension types [2].  This seems to cover
> the
>  use-cases you mention but I might be misunderstanding?
> 
>  Thanks,
>  Micah
> 
>  [1]
> 
> 
> >>
> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>  [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
> 
>  On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>   wrote:
> 
> > Problem Description
> >
> > Currently Arrow schemas can only contain columns of types
> supported by
> > Arrow. In some cases an Arrow schema maps to an external schema.
> This
> >> can
> > resu

Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread Weston Pace

Congratulations!

On Thu, Apr 11, 2024 at 9:12 AM wish maple  wrote:

> Congrats!
>
> Best,
> Xuwei Fu
>
> Kevin Gurney  于2024年4月11日周四 23:22写道：
>
> > Congratulations, Sarah!! Well deserved!
> > 
> > From: Jacob Wujciak 
> > Sent: Thursday, April 11, 2024 11:14 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore
> >
> > Congratulations and welcome!
> >
> > Am Do., 11. Apr. 2024 um 17:11 Uhr schrieb Raúl Cumplido <
> > rau...@apache.org
> > >:
> >
> > > Congratulations Sarah!
> > >
> > > El jue, 11 abr 2024 a las 13:13, Sutou Kouhei ()
> > > escribió:
> > > >
> > > > Hi,
> > > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Sarah
> > > > Gilmore has accepted an invitation to become a committer on
> > > > Apache Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > >
> >
>

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-08 Thread Weston Pace

> Probably major versions should match between C++ and PyArrow, but I guess
> we could have diverging minor and patch versions. Or at least patch
> versions given that
> a new minor version is usually cut for bug fixes too.

I believe even this would be difficult.  Stable ABIs are very finicky in
C++.  If the public API surface changes in any way then it can lead to
subtle bugs if pyarrow were to link against an older version.  I also am
not sure there is much advantage in trying to separate pyarrow from
arrow-cpp since they are almost always changing in lockstep (e.g. any
change to arrow-cpp enables functionality in pyarrow).

I think we should maybe focus on a few more obvious cases.

I think C#, JS, Java, and Go are the most obvious candidates to decouple.
Even then, we should probably only separate these candidates if they have
willing release managers.

C/GLib, python, and ruby are all tightly coupled to C++ at the moment and
should not be a first priority.  I would have guessed that R is also in
this list but Jacob reported in the original email that they are already
somewhat decoupled?

I don't know anything about swift or matlab.

On Mon, Apr 8, 2024 at 6:23 AM Alessandro Molina
 wrote:

> On Sun, Apr 7, 2024 at 3:06 PM Andrew Lamb  wrote:
>
> >
> > We have had separate releases / votes for Arrow Rust (and Arrow
> DataFusion)
> > and it has served us quite well. The version schemes have diverged
> > substantially from the monorepo (we are on version 51.0.0 in arrow-rs,
> for
> > example) and it doesn't seem to have caused any large confusion with
> users
> >
> >
> I think that versioning will require additional thinking for libraries like
> PyArrow, Java etc...
> For rust this is a non problem because there is no link to the C++ library,
>
> PyArrow instead is based on what the C++ library provides,
> so there is a direct link between the features provided by C++ in a
> specific version
> and the features provided in PyArrow at a specific version.
>
> More or less PyArrow 20 should have the same bug fixes that C++ 20 has,
> and diverging the two versions would lead to confusion easily.
> Probably major versions should match between C++ and PyArrow, but I guess
> we could have diverging minor and patch versions. Or at least patch
> versions given that
> a new minor version is usually cut for bug fixes too.
>

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace

Forgot link:

[1]
https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory

On Tue, Apr 2, 2024 at 11:38 AM Weston Pace  wrote:

> Thanks for taking the time to address my concerns.
>
> > I've split the S3/HTTP URI flight pieces out into a separate document and
> > separate thing to vote on at the request of several people who wanted to
> > view these as two separate proposals to vote on. So this vote *only*
> covers
> > adopting the protocol spec as an "Experimental Protocol" so we can start
> > seeing real world usage to help refine and improve it. That said, I
> believe
> > all clients currently would reject any non-grpc URI.
>
> Ah, I was confused and my comments were mostly about the s3/http proposal.
>
> Regarding the proposal at hand, I went through it in more detail.  I don't
> know much about ucx so I considered two different use cases:
>
>  * The previously mentioned shared memory approach.  I think this is
> compelling as people have asked about shared memory communication from time
> to time and I've always suggested flight over unix sockets though that
> forces a copy.
>  * I think this could also form the basis for large transfers of arrow
> data over a wasm boundary.  Wasm has a concept of shared memory objects[1]
> and a wasm data library could use this to stream data into javascript
> without a copy.
>
> I've added a few more questions to the doc.  Either way, if we're only
> talking about an experimental protocol / suggested recommendation then I'm
> fine voting +1 on this (I'm not sure a formal vote is even needed).  I
> would want to see at least 2 implementations if we wanted to remove the
> experimental label.
>
> On Sun, Mar 31, 2024 at 2:43 PM Joel Lubinitsky 
> wrote:
>
>> +1 to the dissociated transports proposal
>>
>> On Sun, Mar 31, 2024 at 11:14 AM David Li  wrote:
>>
>> > +1 from me as before
>> >
>> > On Thu, Mar 28, 2024, at 18:06, Matt Topol wrote:
>> > >>  There is a word doc with no implementation or PR.  I think there
>> could
>> > > be an implementation / PR.
>> > >
>> > > In the word doc there is a link to a POC implementation[1] showing
>> this
>> > > protocol working with a flight service, ucx and libcudf. The key piece
>> > here
>> > > is that we're voting on adopting this protocol spec (i.e. I'll add it
>> to
>> > > the documentation website) rather than us explicitly providing full
>> > > implementations or abstractions around it. We can provide reference
>> > > implementations like the POC, but I don't think they should be in the
>> > Arrow
>> > > monorepo or else we run the risk of a lot of the same issues that
>> Flight
>> > > has: i.e. Adding anything to Flight in C++ requires fully wrapping the
>> > > grpc/flight primitives with Arrow equivalents to export which
>> increases
>> > the
>> > > maintenance burden on us and makes it more difficult for users to
>> > leverage
>> > > the underlying knobs and dials.
>> > >
>> > >> For example, does any ADBC client respect this protocol today?  If a
>> > > flight server responds with an S3/HTTP URI will the ADBC client
>> download
>> > > the files from the correct place?  Will it at least notice that the
>> URI
>> > is
>> > > not a GRPC URI and give a "I don't have a connector for downloading
>> from
>> > > HTTP/S3" error?
>> > >
>> > > I've split the S3/HTTP URI flight pieces out into a separate document
>> and
>> > > separate thing to vote on at the request of several people who wanted
>> to
>> > > view these as two separate proposals to vote on. So this vote *only*
>> > covers
>> > > adopting the protocol spec as an "Experimental Protocol" so we can
>> start
>> > > seeing real world usage to help refine and improve it. That said, I
>> > believe
>> > > all clients currently would reject any non-grpc URI.
>> > >
>> > >>   I was speaking with someone yesterday and they explained that
>> > > they ended up not choosing Flight for an internal project because
>> Flight
>> > > didn't support something called "cloud fetch" which I have now
>> learned is
>> > >
>> > > I was reading through that link, and it seems like it's pretty much
>> > > *identical* to Flight as it current

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace

pposed if a vote
> > helps)
> > >
> > > Mostly I found that the google doc was easier for iterating on the
> > protocol
> > > specification than a markdown PR for the Arrow documentation as I could
> > > more visually express things without a preview of the rendered
> markdown.
> > If
> > > it would get people to be more likely to vote on this, I can write up
> the
> > > documentation markdown now and create a PR rather than waiting until we
> > > decide we're even going to adopt this protocol as an "official" arrow
> > > protocol.
> > >
> > > Lemme know if there's any other unanswered questions!
> > >
> > > --Matt
> > >
> > > [1]: https://github.com/zeroshade/cudf-flight-ucx
> > > [2]:
> > >
> >
> https://docs.google.com/document/d/1-x7tHWDzpbgmsjtTUnVXeEO4b7vMWDHTu-lzxlK9_hE/edit#heading=h.ub6lgn7s75tq
> > >
> > > On Thu, Mar 28, 2024 at 4:53 PM Weston Pace 
> > wrote:
> > >
> > >> I'm sorry for the very late reply.  Until yesterday I had no real
> > concept
> > >> of what this was talking about and so I had stayed out.
> > >>
> > >> I'm +0 only because it isn't clear what we are voting on.  There is a
> > word
> > >> doc with no implementation or PR.  I think there could be an
> > implementation
> > >> / PR.  For example, does any ADBC client respect this protocol today?
> > If a
> > >> flight server responds with an S3/HTTP URI will the ADBC client
> download
> > >> the files from the correct place?  Will it at least notice that the
> URI
> > is
> > >> not a GRPC URI and give a "I don't have a connector for downloading
> from
> > >> HTTP/S3" error?  In general, I think we do want this in Flight (see
> > >> comments below) and I am very supportive of the idea.  However, if
> > adopting
> > >> this as an experimental proposal helps move this forward then I think
> > >> that's fine.
> > >>
> > >> That being said, I do want to express support for the proposal as a
> > >> concept, at least the "disassociated transports" portion (I can't
> speak
> > to
> > >> UCX/etc.).  I was speaking with someone yesterday and they explained
> > that
> > >> they ended up not choosing Flight for an internal project because
> Flight
> > >> didn't support something called "cloud fetch" which I have now learned
> > is
> > >> [1].  I had recalled looking at this proposal before and this person
> > seemed
> > >> interested and optimistic to know this was being considered for
> Flight.
> > >> This proposal, as I understand it, should make it possible for cloud
> > >> servers to support a cloud fetch style API.  From the discussion I got
> > the
> > >> impression that this cloud fetch approach is useful and generally
> > >> applicable.
> > >>
> > >> So a big +1 for the idea of disassociated transports but I'm not sure
> > why
> > >> we need a vote to start working on it (but I'm not opposed if a vote
> > helps)
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://www.databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html
> > >>
> > >> On Thu, Mar 28, 2024 at 1:04 PM Matt Topol 
> > wrote:
> > >>
> > >> > I'll keep this new vote open for at least the next 72 hours. As
> before
> > >> > please reply with:
> > >> >
> > >> > [ ] +1 Accept this Proposal
> > >> > [ ] +0
> > >> > [ ] -1 Do not accept this proposal because...
> > >> >
> > >> > Thanks everyone!
> > >> >
> > >> > On Wed, Mar 27, 2024 at 7:51 PM Benjamin Kietzman <
> > bengil...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > +1
> > >> > >
> > >> > > On Tue, Mar 26, 2024, 18:36 Matt Topol 
> > wrote:
> > >> > >
> > >> > > > Should I start a new thread for a new vote? Or repeat the
> original
> > >> vote
> > >> > > > email here?
> > >> > > >
> > >> > > > Just asking since there hasn't been any responses so far.
> > >> &

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Weston Pace

Wouldn't support for ADT require expressing more than 1 type id per
record?  In other words, if `put` has type id 1, `delete` has type id 2,
and `erase` has type id 3 then there is no way to express something is (for
example) both type id 1 and type id 3 because you can only have one type id
per record.

If that understanding is correct then it seems you can always encode world
2 into world 1 by exhaustively listing out the combinations.  In other
words, `put` is the LSB, `delete` is bit 2, and `erase` is bit 3 and you
have:

7 - put/delete/erase
6 - delete/erase
5 - erase/put
4 - erase
3 - put/delete
2 - delete
1 - put

On Tue, Apr 2, 2024 at 4:36 AM Finn Völkel  wrote:

> I also meant Algebraic Data Type not Abstract Data Type (too many
> acronymns).
>
> On Tue, 2 Apr 2024 at 13:28, Antoine Pitrou  wrote:
>
> >
> > Thanks. The Arrow spec does support multiple union members with the same
> > type, but not all implementations do. The C++ implementation should
> > support it, though to my surprise we do not seem to have any tests for
> it.
> >
> > If the Java implementation doesn't, then you can probably open an issue
> > for it (and even submit a PR if you would like to tackle it).
> >
> > I've also opened https://github.com/apache/arrow/issues/40947 to create
> > integration tests for this.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 02/04/2024 à 13:19, Finn Völkel a écrit :
> > >> Can you explain what ADT means ?
> > >
> > > Sorry about that. ADT stands for Abstract Data Type. What do I mean by
> an
> > > ADT style vector?
> > >
> > > Let's take an example from the project I am on. We have an `op` union
> > > vector with three child vectors `put`, `delete`, `erase`. `delete` and
> > > `erase` have the same type but represent different things.
> > >
> > > On Tue, 2 Apr 2024 at 13:16, Steve Kim  wrote:
> > >
> > >> Thank you for asking this question. I have the same question.
> > >>
> > >> I noted a similar problem in the c++/python implementation:
> > >> https://github.com/apache/arrow/issues/19157#issuecomment-1528037394
> > >>
> > >> On Tue, Apr 2, 2024, 04:30 Finn Völkel  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> my question primarily concerns the union layout described at
> > >>> https://arrow.apache.org/docs/format/Columnar.html#union-layout
> > >>>
> > >>> There are two ways to use unions:
> > >>>
> > >>> - polymorphic vectors (world 1)
> > >>> - ADT style vectors (world 2)
> > >>>
> > >>> In world 1 you have a vector that stores different types. In the ADT
> > >> world
> > >>> you could have multiple child vectors with the same type but
> different
> > >> type
> > >>> ids in the union type vector. The difference is apparent if you want
> to
> > >> use
> > >>> two BigIntVectors as children which doesn't exist in world 1. World 1
> > is
> > >> a
> > >>> subset of world 2.
> > >>>
> > >>> The spec (to my understanding) doesn’t explicitly forbid world 2, but
> > the
> > >>> implementation we have been using (Java) has been making the
> assumption
> > >> of
> > >>> being in world 1 (a union only having ONE child of each type). We
> > >> sometimes
> > >>> use union in the ADT style which has led to problems down the road.
> > >>>
> > >>> Could someone clarify what the specification allows and what it
> doesn’t
> > >>> allow? Could we tighten the specification after that clarification?
> > >>>
> > >>> Best, Finn
> > >>>
> > >>
> > >
> >
>

Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread Weston Pace

Congratulations Joel!

On Mon, Apr 1, 2024 at 1:16 PM Bryce Mecum  wrote:

> Congrats, Joel!
>
> On Mon, Apr 1, 2024 at 6:59 AM Matt Topol  wrote:
> >
> > On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky
> has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> and
> > thank you for your contributions!
> >
> > --Matt
>

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-03-29 Thread Weston Pace

Thank you for bringing this up.  I'm in favor of this.  I think there are
several motivations but the main ones are:

 1. Decoupling the versions will allow components to have no release, or
only a minor release, when there are no breaking changes
 2. We do have some vote fatigue I think and we don't want to make that
more difficult.
 3. Anything we can do to ease the burden of release managers is good

If I understand what you are describing then I think it satisfies points 1
& 2.  I am not familiar enough with the release management process to speak
to #3.

> Voting in one thread on
> all components/a subset of components per voter and the surrounding
> technicalities is something I would like to hear some opinions on.

I am in favor of decoupling the version numbers.  I do think batched
quarterly releases are still a good thing to avoid vote fatigue.  Perhaps
we can have a single vote on a batch of version numbers (e.g. please vote
on the batched release containing CPP version X, Go version Y, JS version
Z).

> A more meta question is about the messaging that different versioning
> schemes carry, as it might no longer be obvious on first glance which
> versions are compatible or have the newest features.

I am not concerned about this.  One of the advantages of Arrow is that we
have a stable C ABI (C Data Interface) and a stable IPC mechanism (IPC
serialization) and this means that version compatibility is rarely a
difficulty or major concern.  Plus, regarding individual features, our
solution already requires a compatibility table (
https://arrow.apache.org/docs/status.html).  Changing the versioning
strategy will not make this any worse.

On Thu, Mar 28, 2024 at 1:42 PM Jacob Wujciak  wrote:

> Hello Everyone!
>
> I would like to resurface the discussion of separate
> versioning/releases/voting for monorepo components. We have previously
> touched on this topic mostly in the community meetings and spread across
> multiple, only tangential related threads. I think a focused discussion can
> be a bit more results oriented, especially now that we almost regularly
> deviate from the quarterly release cadence with minor releases. My hope is
> that discussing this and adapting our process can lower the amount of work
> required and ease the pressure on our release managers (Thank you Raúl and
> Kou!).
>
> I think the base of the topic is the separate versioning for components as
> otherwise separate releases only have limited value. From a technical
> perspective standalone implementations like Go or JS are the easiest to
> handle in that regard, they can just follow their ecosystem standards,
> which has been requested by users already (major releases in Go require
> manual editing across a code base as dependencies are usually pinned to a
> major version).
>
> For Arrow C++ bindings like Arrow R and PyArrow having distinct versions
> would require additional work to both enable the use of different versions
> and ensure version compatibility is monitored and potentially updated if
> needed.
>
> For Arrow R we have already implemented these changes for different reasons
> and have backwards compatibility with  libarrow >= 13.0.0. From a user
> standpoint of PyArrow this is likely irrelevant as most users get binary
> wheels from pypi, if a user regularly builds PyArrow from source they are
> also capable of managing potentially different libarrow version
> requirements as this is already necessary to build the package just with an
> exact version match.
>
> A more meta question is about the messaging that different versioning
> schemes carry, as it might no longer be obvious on first glance which
> versions are compatible or have the newest features. Though I would argue
> that this  a marginal concern at best as there is no guarantee of feature
> parity between different components with the same version. Breaking that
> implicit expectation with separate versions could be seen as clearer. If a
> component only receives dependency bumps or minor bug fixes, releasing this
> component with a patch version aligns much better with expectations than a
> major version bump. In addition there are already several differently
> versioned libraries in the apache/arrow-* ecosystem that are released
> outside of the monorepo release process.  A proper support policy for each
> component would also be required but could just default to 'current major
> release' as it is now.
>
> From an ASF perspective there is no requirement to release the entire
> repository at once as the actual release artifact is the source tarball. As
> long as that is verified and voted on by the PMC it is an official release.
>
> This brings me to the release process and voting. I think it is pretty
> clear that completely decoupling all components and their release processes
> isn't feasible at the moment, mainly from a technical perspective
> (crossbow) and would likely also lead to vote fatigue. We have made efforts
> to ease the verificatio

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-03-28 Thread Weston Pace

I'm sorry for the very late reply.  Until yesterday I had no real concept
of what this was talking about and so I had stayed out.

I'm +0 only because it isn't clear what we are voting on.  There is a word
doc with no implementation or PR.  I think there could be an implementation
/ PR.  For example, does any ADBC client respect this protocol today?  If a
flight server responds with an S3/HTTP URI will the ADBC client download
the files from the correct place?  Will it at least notice that the URI is
not a GRPC URI and give a "I don't have a connector for downloading from
HTTP/S3" error?  In general, I think we do want this in Flight (see
comments below) and I am very supportive of the idea.  However, if adopting
this as an experimental proposal helps move this forward then I think
that's fine.

That being said, I do want to express support for the proposal as a
concept, at least the "disassociated transports" portion (I can't speak to
UCX/etc.).  I was speaking with someone yesterday and they explained that
they ended up not choosing Flight for an internal project because Flight
didn't support something called "cloud fetch" which I have now learned is
[1].  I had recalled looking at this proposal before and this person seemed
interested and optimistic to know this was being considered for Flight.
This proposal, as I understand it, should make it possible for cloud
servers to support a cloud fetch style API.  From the discussion I got the
impression that this cloud fetch approach is useful and generally
applicable.

So a big +1 for the idea of disassociated transports but I'm not sure why
we need a vote to start working on it (but I'm not opposed if a vote helps)

[1]
https://www.databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html

On Thu, Mar 28, 2024 at 1:04 PM Matt Topol  wrote:

> I'll keep this new vote open for at least the next 72 hours. As before
> please reply with:
>
> [ ] +1 Accept this Proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
> Thanks everyone!
>
> On Wed, Mar 27, 2024 at 7:51 PM Benjamin Kietzman 
> wrote:
>
> > +1
> >
> > On Tue, Mar 26, 2024, 18:36 Matt Topol  wrote:
> >
> > > Should I start a new thread for a new vote? Or repeat the original vote
> > > email here?
> > >
> > > Just asking since there hasn't been any responses so far.
> > >
> > > --Matt
> > >
> > > On Thu, Mar 21, 2024 at 11:46 AM Matt Topol 
> > > wrote:
> > >
> > > > Absolutely, it will be marked experimental until we see some people
> > using
> > > > it and can get more real-world feedback.
> > > >
> > > > There's also already a couple things that will be followed-up on
> after
> > > the
> > > > initial adoption for expansion which were discussed in the comments.
> > > >
> > > > On Thu, Mar 21, 2024, 11:42 AM David Li  wrote:
> > > >
> > > >> I think let's try again. Would it be reasonable to declare this
> > > >> 'experimental' for the time being, just as we did with Flight/Flight
> > > >> SQL/etc?
> > > >>
> > > >> On Tue, Mar 19, 2024, at 15:24, Matt Topol wrote:
> > > >> > Hey All, It's been another month and we've gotten a whole bunch of
> > > >> feedback
> > > >> > and engagement on the document from a variety of individuals.
> Myself
> > > >> and a
> > > >> > few others have proactively attempted to reach out to as many
> third
> > > >> parties
> > > >> > as we could, hoping to pull more engagement also. While it would
> be
> > > >> great
> > > >> > to get even more feedback, the comments have slowed down and we
> > > haven't
> > > >> > gotten anything in a few days at this point.
> > > >> >
> > > >> > If there's no objections, I'd like to try to open up for voting
> > again
> > > to
> > > >> > officially adopt this as a protocol to add to our docs.
> > > >> >
> > > >> > Thanks all!
> > > >> >
> > > >> > --Matt
> > > >> >
> > > >> > On Sat, Mar 2, 2024 at 6:43 PM Paul Whalen 
> > > wrote:
> > > >> >
> > > >> >> Agreed that it makes sense not to focus on in-place updating for
> > this
> > > >> >> proposal.  I’m not even sure it’s a great fit as a “general
> > purpose”
> > > >> Arrow
> > > >> >> protocol, because of all the assumptions and restrictions
> required
> > as
> > > >> you
> > > >> >> noted.
> > > >> >>
> > > >> >> I took another look at the proposal and don’t think there’s
> > anything
> > > >> >> preventing in-place updating in the future - ultimately the data
> > body
> > > >> could
> > > >> >> just be in the same location for subsequent messages.
> > > >> >>
> > > >> >> Thanks!
> > > >> >> Paul
> > > >> >>
> > > >> >> On Fri, Mar 1, 2024 at 5:28 PM Matt Topol <
> zotthewiz...@gmail.com>
> > > >> wrote:
> > > >> >>
> > > >> >> > > @pgwhalen: As a potential "end user developer," (and aspiring
> > > >> >> > contributor) this
> > > >> >> > immediately excited me when I first saw it.
> > > >> >> >
> > > >> >> > Yay! Good to hear that!
> > > >> >> >
> > > >> >> > > @pgwhalen: And it wasn't clear to me whether updating batches
> > in
> > > >>

Re: Apache Arrow Flight - From Rust to Javascript (FlightData)

2024-03-21 Thread Weston Pace

> I don't think there is currently a direct equivalent to
> `FlightRecordBatchStream` in the arrow javascript library, but you should
> be able to combine the data header + body and then read it using the
> `fromIPC` functions since it's just the Arrow IPC format

The RecordBatchReader[1] _should_ support streaming.  That being said, I
haven't personally used it in that way.  I've been doing some JS/Rust
lately but utilizing FFI and not flight.  We convert each batch into a
complete IPC file (in an in-memory buffer) and then just use
`tableFromIPC`.  We do this here[2] (in case its at all useful).

[1]
https://arrow.apache.org/docs/js/classes/Arrow_dom.RecordBatchReader.html
[2]
https://github.com/lancedb/lancedb/blob/v0.4.13/nodejs/lancedb/query.ts#L22

On Wed, Mar 20, 2024 at 9:18 AM Matt Topol 
wrote:

> I don't think there is currently a direct equivalent to
> `FlightRecordBatchStream` in the arrow javascript library, but you should
> be able to combine the data header + body and then read it using the
> `fromIPC` functions since it's just the Arrow IPC format
>
> On Fri, Mar 15, 2024 at 5:39 AM Alexander Lerche Falk 
> wrote:
>
> > Hey Arrow Dev,
> >
> > First of all: thanks for your great work.
> >
> > Is there a way to go from the FlightData structure in Javascript back to
> > Record Batches? I have a small Rust application, implementing the
> > FlightService, and streaming the data back through the do_get method.
> >
> > Maybe something equivalent to the FlightRecordBatchStream (method in
> Rust)
> > in the arrow javascript library?
> >
> > Thanks in advance and have a nice day.
> >
> >
> > Alexander Falk
> >
> > Senior Software Architect
> >
> > +45 22 95 08 64
> >
> >
> > Backstage CPH
> >
>

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread Weston Pace

Congratulations!

On Sun, Mar 17, 2024, 8:01 PM Jacob Wujciak  wrote:

> Congrats, well deserved!
>
> Nic Crane  schrieb am Mo., 18. März 2024, 03:24:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> and
> > thank you for your contributions!
> >
> > Nic
> >
>

Re: [DISCUSS] Looking for feedback on my Rust library

2024-03-14 Thread Weston Pace

Felipe's points are good.

I don't know that you need to adapt the entire ADBC, it sort of depends
what you're after.  I see what you've got right now as more of an SQL
abstraction layer.  For example, similar to things like [1][2][3] (though 3
is more of an ORM).  If you like the SQL interface that you've come up with
then you could add, in addition to your postgres / sqlite / etc. bindings,
an ADBC implementation.  This would adapt anything that implements ADBC to
your interface.  This way you could get, in theory, free support for
backends like flight sql or snowflake, and you could replace your duckdb /
postgres backends if you wanted.

I will pass on some feedback I received recently.  If your audience is
"regular developers" (e.g. not data engineers, people building webapps, ML
apps, etc.) then they often do not know or want to speak Arrow.  They see
it as an essential component, but one that is sort of a "database internal"
or a "data engineering thing".  For example, in the python / pyarrow world
people are happy to know that arrow data is traversing their network but,
when they want to actually work with it (e.g. display results to users),
they convert it to python lists or pandas (fortunately arrow makes this
easy).

For example, if you look at postgres' rust bindings you will see that
people process results like this:

```

for row in client.query("SELECT id, name, data FROM person", &[])? {
let id: i32 = row.get(0);
let name: &str = row.get(1);
let data: Option<&[u8]> = row.get(2);

println!("found person: {} {} {:?}", id, name, data);
}

```

The `get` method can be templated to anything implementing the `FromSql`
trait.  This lets rust devs use types they are familiar with (e.g. `&str`,
`i32`, `&[u8]`) instead of having to learn a new technology (whatever
postgres is using internally)

On the other hand, if your audience is, in fact, data engineers, then that
sort of native row-based interface is going to be too efficient.  So there
are definitely uses for both.

[1] https://sequelize.org/v3/
[2] https://docs.rs/quaint/latest/quaint/
[3] https://www.sqlalchemy.org/

On Thu, Mar 14, 2024 at 4:19 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Two comments:
>
> ——
>
> Since this library is analogous to things like ADBC, ODBC, and JDBC, it’s
> more of a “driver” than a “connector”. This might make your life easier
> when explaining what it does.
>
> It’s not a black and white thing, but “connector” might imply networking to
> some people.
>
> I believe you delegate the networking bits of interacting with PostgreSQL
> to a Rust connector.
>
> ——
>
> This library would be more interesting if it could be a wrapper of
> language-agnostic database standards like ADBC and ODBC. The Rust compiler
> can call and expose functions that follow the C ABI — the only true code
> interface standard available on every OS/Architecture pair.
>
> This would mean that any database that exposes ADBC/ODBC can be used from
> your driver. You would still offer a rich Rust interface, but everything
> would translate to well-defined operations that vendors implement. This
> also reduces the chances of you providing things that are heavily biased
> towards the way the databases you supported initially work.
>
> —
> Felipe
>
>
>
> On Tue, 12 Mar 2024 at 09:28 Aljaž Eržen  wrote:
>
> > Hello good folks of Apache Arrow! I come looking for feedback on my
> > Rust crate connector_arrow [1], which is an Arrow database client that
> > is able to connect to multiple databases over their native protocols.
> >
> > It is very similar to ADBC, but better adapted for the Rust ecosystem,
> > as it can be compiled with plain cargo and uses established crates for
> > connecting to the databases.
> >
> > The main feedback I need is the API exposed by the library [2]. I've
> > tried to keep it minimal and it turned out much more concise than the
> > api exposed by ADBC. Have I missed important features?
> >
> > Aljaž Mur Eržen
> >
> > [1]: https://docs.rs/connector_arrow/latest/connector_arrow/
> > [2]:
> >
> https://github.com/aljazerzen/connector_arrow/blob/main/connector_arrow/src/api.rs
> >
>

Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Weston Pace

+1 (binding)

On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb  wrote:

> Hello,
>
> As we have discussed[1][2] I would like to vote on the proposal to
> create a new Apache Top Level Project for DataFusion. The text of the
> proposed resolution and background document is copy/pasted below
>
> If the community is in favor of this, we plan to submit the resolution
> to the ASF board for approval with the next Arrow report (for the
> April 2024 board meeting).
>
> The vote will be open for at least 7 days.
>
> [ ] +1 Accept this Proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
> Andrew
>
> [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> [2] https://github.com/apache/arrow-datafusion/discussions/6475
>
> -- Proposed Resolution -
>
> Resolution to Create the Apache DataFusion Project from the Apache
> Arrow DataFusion Sub Project
>
> =
>
> X. Establish the Apache DataFusion Project
>
> WHEREAS, the Board of Directors deems it to be in the best
> interests of the Foundation and consistent with the
> Foundation's purpose to establish a Project Management
> Committee charged with the creation and maintenance of
> open-source software related to an extensible query engine
> for distribution at no charge to the public.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee (PMC), to be known as the "Apache DataFusion Project",
> be and hereby is established pursuant to Bylaws of the
> Foundation; and be it further
>
> RESOLVED, that the Apache DataFusion Project be and hereby is
> responsible for the creation and maintenance of software
> related to an extensible query engine; and be it further
>
> RESOLVED, that the office of "Vice President, Apache DataFusion" be
> and hereby is created, the person holding such office to
> serve at the direction of the Board of Directors as the chair
> of the Apache DataFusion Project, and to have primary responsibility
> for management of the projects within the scope of
> responsibility of the Apache DataFusion Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and
> hereby are appointed to serve as the initial members of the
> Apache DataFusion Project:
>
> * Andy Grove (agr...@apache.org)
> * Andrew Lamb (al...@apache.org)
> * Daniël Heres (dhe...@apache.org)
> * Jie Wen (jake...@apache.org)
> * Kun Liu (liu...@apache.org)
> * Liang-Chi Hsieh (vii...@apache.org)
> * Qingping Hou: (ho...@apache.org)
> * Wes McKinney(w...@apache.org)
> * Will Jones (wjones...@apache.org)
>
> RESOLVED, that the Apache DataFusion Project be and hereby
> is tasked with the migration and rationalization of the Apache
> Arrow DataFusion sub-project; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache
> Arrow DataFusion sub-project encumbered upon the
> Apache Arrow Project are hereafter discharged.
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
> be appointed to the office of Vice President, Apache DataFusion, to
> serve in accordance with and subject to the direction of the
> Board of Directors and the Bylaws of the Foundation until
> death, resignation, retirement, removal or disqualification,
> or until a successor is appointed.
> =
>
>
> ---
>
>
> Summary:
>
> We propose creating a new top level project, Apache DataFusion, from
> an existing sub project of Apache Arrow to facilitate additional
> community and project growth.
>
> Abstract
>
> Apache Arrow DataFusion[1]  is a very fast, extensible query engine
> for building high-quality data-centric systems in Rust, using the
> Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
> APIs, excellent performance, built-in support for CSV, Parquet, JSON,
> and Avro, extensive customization, and a great community.
>
> [1] https://arrow.apache.org/datafusion/
>
>
> Proposal
>
> We propose creating a new top level ASF project, Apache DataFusion,
> governed initially by a subset of the Apache Arrow project’s PMC and
> committers. The project’s code is in five existing git repositories,
> currently governed by Apache Arrow which would transfer to the new top
> level project.
>
> Background
>
> When DataFusion was initially donated to the Arrow project, it did not
> have a strong enough community to stand on its own. It has since grown
> significantly, and benefited immensely from being part of Arrow and
> nurturing of the Apache Way, and now has a community strong enough to
> stand on its own and that would benefit from focused governance
> attention.
>
> The community has discussed this idea publicly for more than 6 months
> https://github.com/apache/arrow-datafusion/discussions/6475  and
> briefly on the Arrow PMC mailing list
> https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As
> of the time of this writing both had exclusively positive reactions.
>
>

Re: [ANNOUNCE] New Arrow committer: Jay Zhan

2024-02-16 Thread Weston Pace

Congrats!

On Fri, Feb 16, 2024 at 3:07 AM Raúl Cumplido  wrote:

> Congratulations!!
>
> El vie, 16 feb 2024 a las 12:02, Daniël Heres
> () escribió:
> >
> > Congratulations!
> >
> > On Fri, Feb 16, 2024, 11:33 Metehan Yıldırım <
> metehan.yildi...@synnada.ai>
> > wrote:
> >
> > > Congrats!
> > >
> > > On Fri 16. Feb 2024 at 13:26, Andrew Lamb 
> wrote:
> > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Jay Zhan
> > > > has accepted an invitation to become a committer on Apache
> > > > Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > Andrew
> > > >
> > >
>

Re: [DISC] Improve Arrow Release verification process

2024-01-21 Thread Weston Pace

+1.  There have been a few times I've attempted to run the verification
scripts.  They have failed, but I was pretty confident it was a problem
with my environment mixing with the verification script and not a problem
in the software itself and I didn't take the time to debug the verification
script issues.  So even if there were a true issue I doubt the manual
verification process would help me catch it.

Also, most devs seem to be on fairly consistent development environments
(Ubuntu or Macbook).  So rather than spend time allowing many people to
verify Ubuntu works we could probably spend that time building extra CI
environments that  provide more coverage.

On Fri, Jan 19, 2024 at 1:49 PM Jacob Wujciak-Jens
 wrote:

> I concur, a minimally scoped verification script for the actual voting
> process without any binary verification etc. should be created. The ease in
> verifying a release will lower the burden to participate in the vote which
> is good for the community and will even be necessary if we ever want to
> increase release cadance as previously discussed.
>
> In my opinion it will also mean that the binaries are no longer part of the
> release, which will help in situations similar to the release of Python
> 3.12 just after 14.0.0 was released and lots of users were running into
> issues because there were no 14.0.0 wheels for 3.12.
>
> While it would still be nice to potentially make reproduction of CI errors
> easier by having better methods to restart a failed script, this is of much
> lower importance then improving the release process.
>
> Jacob
>
> On Fri, Jan 19, 2024 at 7:38 PM Andrew Lamb  wrote:
>
> > I would second this notion that manually running tests that are already
> > covered as part of CI as part of the release process is of (very) limited
> > value.
> >
> > While we do the same thing (compile and run some tests) as part of the
> Rust
> > release this has never caught any serious defect I am aware of and we
> only
> > run a subset of tests (e.g. not tests for integration with other arrow
> > versions)
> >
> > Reducing the burden for releases I think would benefit everyone.
> >
> > Andrew
> >
> > On Fri, Jan 19, 2024 at 1:08 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Well, if the main objective is to just follow the ASF Release
> > > guidelines, then our verification process can be simplified
> drastically.
> > >
> > > The ASF indeed just requires:
> > > """
> > > Every ASF release MUST contain one or more source packages, which MUST
> > > be sufficient for a user to build and test the release provided they
> > > have access to the appropriate platform and tools. A source release
> > > SHOULD not contain compiled code.
> > > """
> > >
> > > So, basically, if the source tarball is enough to compile Arrow on a
> > > single platform with a single set of tools, then we're ok. :-)
> > >
> > > The rest is just an additional burden that we voluntarily inflict to
> > > ourselves. *Ideally*, our continuous integration should be able to
> > > detect any build-time or runtime issue on supported platforms. There
> > > have been rare cases where some issues could be detected at release
> time
> > > thanks to the release verification script, but these are a tiny portion
> > > of all issues routinely detected in the form of CI failures. So, there
> > > doesn't seem to be a reason to believe that manual release verification
> > > is bringing significant benefits compared to regular CI.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 19/01/2024 à 11:42, Raúl Cumplido a écrit :
> > > > Hi,
> > > >
> > > > One of the challenges we have when doing a release is verification
> and
> > > voting.
> > > >
> > > > Currently the Arrow verification process is quite long, tedious and
> > > error prone.
> > > >
> > > > I would like to use this email to get feedback and user requests in
> > > > order to improve the process.
> > > >
> > > > Several things already on my mind:
> > > >
> > > > One thing that is quite annoying is that any flaky failure makes us
> > > > restart the process and possibly requires downloading everything
> > > > again. It would be great to have some kind of retry mechanism that
> > > > allows us to keep going from where it failed and doesn't have to redo
> > > > the previous successful jobs.
> > > >
> > > > We do have a bunch of flags to do specific parts but that requires
> > > > knowledge and time to go over the different flags, etcetera so the UX
> > > > could be improved.
> > > >
> > > > Based on the ASF release policy [1] in order to cast a +1 vote we
> have
> > > > to validate the source code packages but it is not required to
> > > > validate binaries locally. Several binaries are currently tested
> using
> > > > docker images and they are already tested and validated on CI. Our
> > > > documentation for release verification points to perform binary
> > > > validation. I plan to update the documentation and move it to the
> > > > official docs instead of the wik

Re: [DISCUSS] Semantics of extension types

2023-12-14 Thread Weston Pace

I agree engines can use their own strategy.  Requiring explicit casts is
probably ok as long as it is well documented but I think I lean slightly
towards implicitly falling back to the storage type.  I do think think
people still shy away from extension types.  Adding the extension type to
an implicit cast registry is another hurdle to their use, albeit a small
one.

Substrait has a similar consideration for extension types.  They can be
declared "inherits" (meaning the storage type can be used implicitly in
compute functions) or "separate" (meaning the storage type cannot be used
implicitly in compute functions).  This would map nicely to an Arrow
metadata field.

Unfortunately, I think the truth is more nuanced than a simple
separate/inherits flag.  Take UUID for example (everyone's favorite fixed
size binary extension type).  We would definitely want to implicitly reuse
the hash, equality, and sorting functions.

However, for other functions it gets trickier.  Imagine you have a
`replace_slice` function.  Should it return a new UUID (change some bytes
in a UUID and you have a new UUID) or not (once you start changing bytes in
a UUID you no longer have a UUID).  Or what if there was a `slice`
function.  This function should either be prohibited (you can't slice a
UUID) or should return a fixed size binary string (you can still slice it
but you no longer have a UUID).

Given the complication I think users will always need to carefully consider
each use of an extension function no matter how smart a system is.  I'm not
convinced any metadata exists that could define the right approach in a
consistent number of cases.  This means our choice is whether we force
users to explicitly declare each such decision or we just trust that they
are doing the proper consideration when they design their plan.  I'm not
sure there is a right answer.  One can point to the vast diversity of ways
that programming languages have approached implicit vs explicit integer
casts.

My last concern is that we rely on compute functions in operators other
than project/filter.  For example, to use a column as a key for a hash-join
we need to be able to compute the hash value and calculate equality.  To
use a column as a key for sorting we need an ordering function.  These are
places where it might be unexpected for users to insert explicit casts.  An
engine would need to make sure the error message in these cases was very
clear.

On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou  wrote:

>
> Hi,
>
> For now, I would suggest that each implementation decides on their own
> strategy, because we don't have a clear idea of which is better (and
> extension types are probably not getting a lot of use yet).
>
> Regards
>
> Antoine.
>
>
> Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > The main problem I see with adding properties to ExtensionType is I'm not
> > sure where that information would reside. Allowing type authors to
> declare
> > in which ways the type is equivalent (or not) to its storage is
> appealing,
> > but it seems to need an official extension field like
> > ARROW:extension:semantics. Otherwise I think each extension type's
> > semantics would need to be maintained within every implementation as well
> > as in a central reference (probably in Columnar.rst), which seems
> > unreasonable to expect of extension type authors. I'm also skeptical that
> > useful information could be packed into an ARROW:extension:semantics
> field;
> > even if the type can declare that ordering-as-with-storage is invalid I
> > don't think it'd be feasible to specify the correct ordering.
> >
> > If we cannot attach this information to extension types, the question
> > becomes which defaults are most reasonable for engines and how can the
> > engine most usefully be configured outside those defaults. My own
> > preference would be to refuse operations other than selection or
> > casting-to-storage, with a runtime extensible registry of allowed
> implicit
> > casts. This will allow users of the engine to configure their extension
> > types as they need, and the error message raised when an implicit
> > cast-to-storage is not allowed could include the suggestion to register
> the
> > implicit cast. For applications built against a specific engine, this
> > approach would allow recovering much of the advantage of attaching
> > properties to an ExtensionType by including registration of implicit
> casts
> > in the ExtensionType's initialization.
> >
> > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
> > wrote:
> >
> >> Hello all,
> >>
> >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> from
> >> any extension type to its storage type in acero. This raises questions
> >> about the validity of applying operations to an extension array's
> storage.
> >> For example, some extension type authors may intend different ordering
> for
> >> arrays of their new type than would be applied to the array's storage or
> >> may not inte

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Weston Pace

+1 (binding)

On Fri, Dec 8, 2023 at 1:43 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> On Fri, Dec 8, 2023 at 1:27 PM Antoine Pitrou  wrote:
> >
> > +1 (binding)
> >
> >
> > Le 08/12/2023 à 20:42, David Li a écrit :
> > > Let's start a formal vote just so we're on the same page now that
> we've discussed a few things.
> > >
> > > I would like to propose we remove 'experimental' from Flight SQL and
> make it stable:
> > >
> > > - Remove the 'experimental' option from the Protobuf definitions (but
> leave the option definition for future additions)
> > > - Update specifications/documentation/implementations to no longer
> refer to Flight SQL as experimental, and describe what stable means (no
> backwards-incompatible changes)
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1
> > > [ ] +0
> > > [ ] -1 Keep Flight SQL experimental because...
> > >
> > > On Fri, Dec 8, 2023, at 13:37, Weston Pace wrote:
> > >> +1
> > >>
> > >> On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb 
> wrote:
> > >>>
> > >>>> I agree it is time to "promote" ArrowFlightSQL to the same level as
> other
> > >>>> standards in Arrow
> > >>>>
> > >>>> Now that it is used widely (we use and count on it too at
> InfluxData) I
> > >>>> agree it makes sense to remove the experimental label from the
> overall
> > >>>> spec.
> > >>>>
> > >>>> It would make sense to leave experimental / caveats on any places
> (like
> > >>>> extension APIs) that are likely to change
> > >>>>
> > >>>> Andrew
> > >>>>
> > >>>> On Fri, Dec 8, 2023 at 11:39 AM David Li 
> wrote:
> > >>>>
> > >>>>> Yes, I think we can continue marking new features (like the bulk
> > >>>>> ingest/session proposals) as experimental but remove it from
> anything
> > >>>>> currently in the spec.
> > >>>>>
> > >>>>> On Fri, Dec 8, 2023, at 11:36, Laurent Goujon wrote:
> > >>>>>> I'm the author of the initial pull request which triggered the
> > >>>>> discussion.
> > >>>>>> I was focusing first on the comment in Maven pom.xml files which
> show
> > >>>> up
> > >>>>> in
> > >>>>>> Maven Central and other places, and which got some people confused
> > >>>> about
> > >>>>>> the state of the driver/code. IMHO this would apply to the current
> > >>>>>> Flight/Flight SQL protocol and code as it is today. Protocol
> > >>> extensions
> > >>>>>> should be still deemed experimental if still in their incubating
> > >>> phase?
> > >>>>>>
> > >>>>>> Laurent
> > >>>>>>
> > >>>>>> On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield <
> > >>> emkornfi...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> This applies to mostly existing APIs (e.g. recent additions are
> > >>> still
> > >>>>>>> experimental)? Or would it apply to everything going forward?
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Micah
> > >>>>>>>
> > >>>>>>> On Thu, Dec 7, 2023 at 2:25 PM David Li 
> > >>> wrote:
> > >>>>>>>
> > >>>>>>>> Yes, we'd update the docs, the Protobuf definitions, and
> anything
> > >>>> else
> > >>>>>>>> referring to Flight SQL as experimental.
> > >>>>>>>>
> > >>>>>>>> On Thu, Dec 7, 2023, at 17:14, Joel Lubinitsky wrote:
> > >>>>>>>>> The message types defined in FlightSql.proto are all marked
> > >>>>>>> experimental
> > >>>>>>>> as
> > >>>>>>>>> well. Would this include changes to any of those?
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Dec 7, 2023 at 16:43 Laurent Goujon
> > >>>>>  > >>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> we have been using it with Dremio for a while now, and we
> > >>>> consider
> > >>>>> it
> > >>>>>>>>>> stable
> > >>>>>>>>>>
> > >>>>>>>>>> +1 (not binding)
> > >>>>>>>>>>
> > >>>>>>>>>> Laurent
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Dec 6, 2023 at 4:52 PM Matt Topol
> > >>>>>>>  > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> +1, I agree with everyone else
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Dec 6, 2023 at 7:49 PM James Duong
> > >>>>>>>>>>>  wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> +1 from me. It's used in a good number of databases now.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Get Outlook for Android<https://aka.ms/AAb9ysg>
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> From: David Li 
> > >>>>>>>>>>>> Sent: Wednesday, December 6, 2023 9:59:54 AM
> > >>>>>>>>>>>> To: dev@arrow.apache.org 
> > >>>>>>>>>>>> Subject: [DISCUSS] Flight SQL as experimental
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Flight SQL has been marked 'experimental' since the
> > >>>> beginning.
> > >>>>>>> Given
> > >>>>>>>>>> that
> > >>>>>>>>>>>> it's now used by a few systems for a few years now, should
> > >>> we
> > >>>>>>> remove
> > >>>>>>>>>> this
> > >>>>>>>>>>>> qualifier? I don't expect us to be making breaking changes
> > >>>>>>> anymore.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This came up in a GitHub PR:
> > >>>>>>>>>> https://github.com/apache/arrow/pull/39040
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -David
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
>

Re: [DISCUSS] Flight SQL as experimental

2023-12-08 Thread Weston Pace

+1

On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield 
wrote:

> +1
>
> On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb  wrote:
>
> > I agree it is time to "promote" ArrowFlightSQL to the same level as other
> > standards in Arrow
> >
> > Now that it is used widely (we use and count on it too at InfluxData) I
> > agree it makes sense to remove the experimental label from the overall
> > spec.
> >
> > It would make sense to leave experimental / caveats on any places (like
> > extension APIs) that are likely to change
> >
> > Andrew
> >
> > On Fri, Dec 8, 2023 at 11:39 AM David Li  wrote:
> >
> > > Yes, I think we can continue marking new features (like the bulk
> > > ingest/session proposals) as experimental but remove it from anything
> > > currently in the spec.
> > >
> > > On Fri, Dec 8, 2023, at 11:36, Laurent Goujon wrote:
> > > > I'm the author of the initial pull request which triggered the
> > > discussion.
> > > > I was focusing first on the comment in Maven pom.xml files which show
> > up
> > > in
> > > > Maven Central and other places, and which got some people confused
> > about
> > > > the state of the driver/code. IMHO this would apply to the current
> > > > Flight/Flight SQL protocol and code as it is today. Protocol
> extensions
> > > > should be still deemed experimental if still in their incubating
> phase?
> > > >
> > > > Laurent
> > > >
> > > > On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > >> This applies to mostly existing APIs (e.g. recent additions are
> still
> > > >> experimental)? Or would it apply to everything going forward?
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> On Thu, Dec 7, 2023 at 2:25 PM David Li 
> wrote:
> > > >>
> > > >> > Yes, we'd update the docs, the Protobuf definitions, and anything
> > else
> > > >> > referring to Flight SQL as experimental.
> > > >> >
> > > >> > On Thu, Dec 7, 2023, at 17:14, Joel Lubinitsky wrote:
> > > >> > > The message types defined in FlightSql.proto are all marked
> > > >> experimental
> > > >> > as
> > > >> > > well. Would this include changes to any of those?
> > > >> > >
> > > >> > > On Thu, Dec 7, 2023 at 16:43 Laurent Goujon
> > >  > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > >> we have been using it with Dremio for a while now, and we
> > consider
> > > it
> > > >> > >> stable
> > > >> > >>
> > > >> > >> +1 (not binding)
> > > >> > >>
> > > >> > >> Laurent
> > > >> > >>
> > > >> > >> On Wed, Dec 6, 2023 at 4:52 PM Matt Topol
> > > >>  > > >> > >
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > +1, I agree with everyone else
> > > >> > >> >
> > > >> > >> > On Wed, Dec 6, 2023 at 7:49 PM James Duong
> > > >> > >> >  wrote:
> > > >> > >> >
> > > >> > >> > > +1 from me. It's used in a good number of databases now.
> > > >> > >> > >
> > > >> > >> > > Get Outlook for Android
> > > >> > >> > > 
> > > >> > >> > > From: David Li 
> > > >> > >> > > Sent: Wednesday, December 6, 2023 9:59:54 AM
> > > >> > >> > > To: dev@arrow.apache.org 
> > > >> > >> > > Subject: [DISCUSS] Flight SQL as experimental
> > > >> > >> > >
> > > >> > >> > > Flight SQL has been marked 'experimental' since the
> > beginning.
> > > >> Given
> > > >> > >> that
> > > >> > >> > > it's now used by a few systems for a few years now, should
> we
> > > >> remove
> > > >> > >> this
> > > >> > >> > > qualifier? I don't expect us to be making breaking changes
> > > >> anymore.
> > > >> > >> > >
> > > >> > >> > > This came up in a GitHub PR:
> > > >> > >> https://github.com/apache/arrow/pull/39040
> > > >> > >> > >
> > > >> > >> > > -David
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
>

Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread Weston Pace

Congratulations Felipe!

On Thu, Dec 7, 2023 at 8:38 AM wish maple  wrote:

> Congrats Felipe!!!
>
> Best,
> Xuwei Fu
>
> Benjamin Kietzman  于2023年12月7日周四 23:42写道：
>
> > On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira
> > Carvalho
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > Ben Kietzman
> >
>

Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread Weston Pace

Congrats Andy!

On Mon, Nov 27, 2023, 7:31 PM wish maple  wrote:

> Congrats Andy!
>
> Best,
> Xuwei Fu
>
> Andrew Lamb  于2023年11月27日周一 20:47写道：
>
> > I am pleased to announce that the Arrow Project has a new PMC chair and
> VP
> > as per our tradition of rotating the chair once a year. I have resigned
> and
> > Andy Grove was duly elected by the PMC and approved unanimously by the
> > board.
> >
> > Please join me in congratulating Andy Grove!
> >
> > Thanks,
> > Andrew
> >
>

Re: [ANNOUNCE] New Arrow committer: James Duong

2023-11-17 Thread Weston Pace

Congratulations James

On Fri, Nov 17, 2023 at 6:07 AM Metehan Yıldırım <
metehan.yildi...@synnada.ai> wrote:

> Congratulations!
>
> On Thu, Nov 16, 2023 at 10:45 AM Sutou Kouhei  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that James Duong
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > --
> > kou
> >
> >
> >
>

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Weston Pace

Congratulations Raúl!

On Mon, Nov 13, 2023 at 1:34 PM Ben Harkins 
wrote:

> Congrats, Raúl!!
>
> On Mon, Nov 13, 2023 at 4:30 PM Bryce Mecum  wrote:
>
> > Congrats, Raúl!
> >
> > On Mon, Nov 13, 2023 at 10:28 AM Andrew Lamb 
> > wrote:
> > >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Raúl Cumplido  to become a PMC member and we are pleased to announce
> > > that  Raúl Cumplido has accepted.
> > >
> > > Please join me in congratulating them.
> > >
> > > Andrew
> >
>

Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Weston Pace

+1 for the original proposal as well.

---

The (minor) problem I see with flags is that there isn't much point to this
feature if you are gating on a flag.  I'm assuming the goal is what Dewey
originally mentioned which is making buffer calculations easier.  However,
if you're gating the feature with a flag then you are either:

 * Rejecting input from producers that don't support this feature
(undesirable, better to align on one use model if we can)
 * Doing all the work anyways to handle producers that don't support the
feature

Maybe it makes sense for a long term migration (e.g. we all agree this is
something we want to move towards but we need to handle old producers in
the meantime) but we can always discuss that separately and I don't think
the benefit here is worth the confusion.

On Tue, Nov 7, 2023 at 7:46 AM Will Jones  wrote:

> I agree with the approach originally proposed by Ben. It seems like the
> most straightforward way to implement within the current protocol.
>
> On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
>  wrote:
>
> > In the absence of a general solution to the C data interface omitting
> > buffer sizes, I think the original proposal is the best way
> > forward...this is the first type to be added whose buffer sizes cannot
> > be calculated without looping over every element of the array; the
> > buffer sizes are needed to efficiently serialize the imported array to
> > IPC if imported by a consumer that cares about buffer sizes.
> >
> > Using a schema's flags to indicate something about a specific paired
> > array (particularly one that, if misinterpreted, would lead to a
> > crash) is a precedent that is probably not worth introducing for just
> > one type. Currently a schema is completely independent of any
> > particular ArrowArray, and I think that is a feature that is worth
> > preserving. My gripes about not having buffer sizes on the CPU to more
> > efficiently copy between devices is a concept almost certainly better
> > suited to the ArrowDeviceArray struct.
> >
> > On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman 
> > wrote:
> > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value.
> > >
> > > It seems to me that ignoring unknown flags is the primary case to
> > consider
> > > at
> > > this point, since consumers may ignore unknown flags. Since that is the
> > > case,
> > > it seems adding any flag which would break such a consumer would be
> > > tantamount to an ABI breakage. I don't think this can be averted unless
> > all
> > > consumers are required to error out on unknown flag values.
> > >
> > > In the specific case of Utf8View it seems certain that consumers would
> > add
> > > support for the buffer sizes flag simultaneously with adding support
> for
> > the
> > > new type (since Utf8View is difficult to import otherwise), so any
> > consumer
> > > which would error out on the new flag would already be erroring out on
> an
> > > unsupported data type.
> > >
> > > > I might be the only person who has implemented
> > > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > > schema's flag value
> > >
> > > I think passing a schema's flag value including unknown flags is an
> > error.
> > > The ABI defines moving structures but does not define deep copying. I
> > think
> > > in order to copy deeply in terms of operations which *are* specified:
> we
> > > import then export the schema. Since this includes an export step, it
> > > should not
> > > include flags which are not supported by the exporter.
> > >
> > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > > >> Is this buffer lengths buffer only present if the array type is
> > > > Utf8View?
> > > > >
> > > > > IIUC, the proposal would add the buffer lengths buffer for all
> types
> > if
> > > > the
> > > > > schema's
> > > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> > avoid
> > > > > the special case and that `n_buffers` would continue to be
> consistent
> > > > with
> > > > > IPC.
> > > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value. We haven't specified that unknown flag values should be
> > > > ignored, so a consumer could judiciously choose to error out instead
> of
> > > > potentially misinterpreting the data.
> > > >
> > > > All in all, personally I'd rather we make a special case for Utf8View
> > > > instead of adding a flag that can lead to worse interoperability.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> >
>

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Weston Pace

Is this buffer lengths buffer only present if the array type is Utf8View?
Or are you suggesting that other types might want to adopt this as well?

On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington
 wrote:

> > I expect C code to not be much longer then this :-)
>
> nanoarrow's buffer-length-calculation and validation concepts are
> (perhaps inadvisably) intertwined...even with both it is not that much
> code (perhaps I was remembering how much time it took me to figure out
> which 35 lines to write :-))
>
> > That sounds a bit hackish to me.
>
> Including only *some* buffer sizes in array->buffers[array->n_buffers]
> special-cased for only two types (or altering the number of buffers
> required by the IPC format vs. the number of buffers required by the C
> Data interface) seem equally hackish to me (not that I'm opposed to
> either necessarily...the alternatives really are very bad).
>
> > How can you *not* care about buffer sizes, if you for example need to
> send the buffers over IPC?
>
> I think IPC is the *only* operation that requires that information?
> (Other than perhaps copying to another device?) I don't think there's
> any barrier to accessing the content of all the array elements but I
> could be mistaken.
>
> On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou  wrote:
> >
> >
> > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > > The lack of buffer sizes is something that has come up for me a few
> > > times working with nanoarrow (which dedicates a significant amount of
> > > code to calculating buffer sizes, which it uses to do validation and
> > > more efficient copying).
> >
> > By the way, this is a bit surprising since it's really 35 lines of code
> > in C++ currently:
> >
> >
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
> >
> > I expect C code to not be much longer then this :-)
> >
> > Regards
> >
> > Antoine.
>

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Weston Pace

Congratulations Xuwei!

On Mon, Oct 23, 2023 at 3:38 AM wish maple  wrote:

> Thanks kou and every nice person in arrow community!
>
> I've learned a lot during learning and contribution to arrow and
> parquet. Thanks for everyone's help.
> Hope we can bring more fancy features in the future!
>
> Best,
> Xuwei Fu
>
> Sutou Kouhei  于2023年10月23日周一 12:48写道：
>
> > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > --
> > kou
> >
>

Re: Apache Arrow file format

2023-10-21 Thread Weston Pace

> Of course, what I'm really asking for is to see how Lance would compare
;-)

> P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
> and GPU performance (in Section 5.9). It also cites Lance as one of the
> future formats (in Section 5.6.2).

Disclaimer: I work for LanceDb and am in no way an unbiased party.
However, since you asked:

TL;DR: Lance performs 10-20x better than orc or parquet when retrieving a
small scattered selection of rows from a large dataset.

I went ahead and reproduced the experiment in the second paper using
Lance.  Specifically, a vector search for 10 elements against the first 100
million rows of the laion 5b dataset.  There were a few details missing in
the paper (specifically around what index they used) and I ended up
training a rather underwhelming index but the performance of the index is
unrelated to the file format and so irrelevant for this discussion anyways.

Vector searches perform a CPU intensive index search against a relatively
small index (in the paper this index was kept in memory and so I did the
same for my experiment).  This identifies the rows of interest.  We then
need to perform a take operation to select those rows from storage.  This
is the part where the file format matters. So all we are really measuring
here is how long it takes to select N rows at random from a dataset.  This
is one of the use cases Lance was designed for and so it is no surprise
that it performs better.

Note that Lance stores data uncompressed.  However, it probably doesn't
matter in this case.  100 million rows of Laion 5B requires ~320GB.  Only
20GB of this is metadata.  The remaining 300GB is text & image embeddings.
These embeddings are, by design, not very compressible.  The entire lance
dataset required 330GB.

# Results:

The chart in the paper is quite small and uses a log scale.  I had to infer
the performance numbers for parquet & orc as best I could.  The numbers for
lance are accurate as that is what I measured.  These results are averaged
from 64 randomized queries (each iteration ran a single query to return 10
results) with the kernel's disk cache cleared (same as the paper I believe).

## S3:

Parquet: ~12,000ms
Orc: ~80,000ms
Lance: 1,696ms

## Local Storage (SSD):

Parquet: ~800ms
Orc: ~65ms (10ms spent searching index)
Lance: 61ms (59ms spent searching index)

At first glance it may seem like Lance performs about the same as Orc with
an SSD.  However, this is likely because my index was suboptimal (I did not
spend any real time tuning it since I could just look at the I/O times
directly).  The lance format spent only 2ms on I/O compared with ~55ms
spent on I/O by Orc.

# Boring Details:

Index: IVF/PQ with 1000 IVF partitions (probably should have been 10k
partitions but I'm not patient enough) and 96 PQ subvectors (1 byte per
subvector)
Hardware: Tests were performed on an r6id.2xlarge (using the attached NVME
for the SSD tests) in the same region as the S3 storage

Minor detail: The embeddings provided with the laion 5b dataset (clip
vit-l/14) were provided as float16.  Lance doesn't yet support float16 and
so I inflated these to float32 (that doubles the amount of data retrieved
so, if anything, it's just making things harder on lance)

On Thu, Oct 19, 2023 at 9:55 AM Aldrin  wrote:

> And the first paper's reference of arrow (in the references section) lists
> 2022 as the date of last access.
>
> Sent from Proton Mail  for iOS
>
>
> On Thu, Oct 19, 2023 at 18:51, Aldrin  > wrote:
>
> For context, that second referenced paper has Wes McKinney as a co-author,
> so they were much better positioned to say "the right things."
>
> Sent from Proton Mail  for iOS
>
>
> On Thu, Oct 19, 2023 at 18:38, Jin Shang  > wrote:
>
> Honestly I don't understand why this VLDB paper [1] chooses to include
> Feather in their evaluations. This paper studies OLAP DBMS file formats.
> Feather is clearly not optimized for the workload and performs badly in
> most of their benchmarks. This paper also has several inaccurate or
> outdated claims about Arrow, e.g. Arrow has no run length encoding, Arrow's
> dictionary encoding only supports string types (Table 3 and 5), Feather is
> Arrow plus dictionary encoding and compression (Section 3.2) etc. Moreover,
> the two optimizations it proposes for Arrow (in Section 8.1.1 and 8.1.3)
> are actually just two new APIs for working with Arrow data that require no
> change to the Arrow format itself. I fear that this paper may actually
> discourage DB people from using Arrow as their *in memory *format, even
> though it's the Arrow *file* format that performs badly for their workload.
>
> There is another paper "An Empirical Evaluation of Columnar Storage
> Formats" [2] covering essentially the same topic. It however chooses not to
> evaluate Arrow (in Section 2) because "Arrow is not meant for long-term
> disk storage", citing Wes McKinney's blog post [3] from six years

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-15 Thread Weston Pace

Congratulations Jon!

On Sun, Oct 15, 2023, 1:51 PM Neal Richardson 
wrote:

> Congratulations!
>
> On Sun, Oct 15, 2023 at 1:35 PM Bryce Mecum  wrote:
>
> > Congratulations, Jon!
> >
> > On Sat, Oct 14, 2023 at 9:24 AM Andrew Lamb 
> wrote:
> > >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Jonathan Keane to become a PMC member and we are pleased to announce
> > > that Jonathan Keane has accepted.
> > >
> > > Congratulations and welcome!
> > >
> > > Andrew
> >
>

Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread Weston Pace

Congratulations!

On Sun, Oct 15, 2023, 8:51 AM Gang Wu  wrote:

> Congrats!
>
> On Sun, Oct 15, 2023 at 10:49 PM David Li  wrote:
>
> > Congrats & welcome Curt!
> >
> > On Sun, Oct 15, 2023, at 09:03, wish maple wrote:
> > > Congratulations!
> > >
> > > Raúl Cumplido  于2023年10月15日周日 20:48写道：
> > >
> > >> Congratulations and welcome!
> > >>
> > >> El dom, 15 oct 2023, 13:57, Ian Cook  escribió:
> > >>
> > >> > Congratulations Curt!
> > >> >
> > >> > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb 
> > wrote:
> > >> >
> > >> > > On behalf of the Arrow PMC, I'm happy to announce that Curt
> > Hagenlocher
> > >> > > has accepted an invitation to become a committer on Apache
> > >> > > Arrow. Welcome, and thank you for your contributions!
> > >> > >
> > >> > > Andrew
> > >> > >
> > >> >
> > >>
> >
>

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Weston Pace

> I feel the broader question here is what is Arrow's intended use case -
interchange or execution

The line between interchange and execution is not always clear.  For
example, I think we would like Arrow to be considered as a standard for UDF
libraries.

On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt  wrote:

> For the index vs pointer question - DuckDB went with pointers as they are
> more flexible, and DuckDB was designed to consume data (and strings) from a
> wide variety of formats in a wide variety of languages. Pointers allows us
> to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> etc. The flip side of pointers is that ownership has to be handled very
> carefully. Our vector format is an execution-only format, and never leaves
> the internals of the engine. This greatly simplifies ownership as we are in
> complete control of what happens inside the engine. For an interchange
> format that is intended for handing data between engines, I can see this
> being more complicated and having verification being more important.
>
> As for the actual change:
>
> From an interchange perspective from DuckDB's side - the proposed
> zero-copy integration would definitely speed up the conversion of DuckDB
> string vectors to Arrow string vectors. In a recent benchmark that we have
> performed we have found string conversion to Arrow vectors to be a
> bottleneck in certain workloads, although we have not sufficiently
> researched if this could be improved in other ways. It is possible this can
> be alleviated without requiring changes to Arrow.
>
> However - in general, a new string vector format is only useful if
> consumers also support the format. If the consumer immediately converts the
> strings back into the standard Arrow string representation then there is no
> benefit. The change will only move where the conversion happens (from
> inside DuckDB to inside the consumer). As such, this change is only useful
> if the broader Arrow ecosystem moves towards supporting the new string
> format.
>
> From an execution perspective from DuckDB's side - it is unlikely that we
> will switch to using Arrow as an internal format at this stage of the
> project. While this change increases Arrow's utility as an intermediate
> execution format, that is more relevant to projects that are currently
> using Arrow in this manner or are planning to use Arrow in this manner.
>
> I feel the broader question here is what is Arrow's intended use case -
> interchange or execution - as they are opposed in this discussion. This
> change improves Arrow's utility as an execution format at the expense of
> more stability in the interchange format. From my perspective Arrow is more
> useful as an interchange format. When different tools communicate with each
> other a standard is required. An execution format is generally not exposed
> outside of the internals of the execution engine. Engines can do whatever
> they want here - and a standard is perhaps not as useful.
>
> On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > I don't think "we have to adjust the Arrow format so that existing
> > > internal representations become Arrow-compliant without any
> > > (re-)implementation effort" is a reasonable design principle.
> >
> > I agree with this statement from Antoine -- given the Arrow community has
> > standardized an addition to the format with StringView, I think it would
> > help to get some input from those at DuckDB and Velox on their
> perspective
> >
> > Andrew
> >
> >
> >
> >
> > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Oh I'm with you on it being a precedent we want to be very careful
> about
> > > setting, but if there isn't a meaningful performance difference, we may
> > > be able to sidestep that discussion entirely.
> > >
> > > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > > >
> > > > Even if performance were significant better, I don't think it's a
> good
> > > > enough reason to add these representations to Arrow. By construction,
> > > > a standard cannot continuously chase the performance state of art, it
> > > > has to weigh the benefits of performance improvements against the
> > > > increased cost for the ecosystem (for example the cost of adapting to
> > > > frequent standard changes and a growing standard size).
> > > >
> > > > We have extension types which could reasonably be used for
> > > > non-standard data types, especially the kind that are motivated by
> > > > leading-edge performance research and innovation and come with
> unusual
> > > > constraints (such as requiring trusting and dereferencing raw
> pointers
> > > > embedded in data buffers). There could even be an argument for making
> > > > some of them canonical extension types if there's enough anteriority
> > > > in favor.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > > >> I think what would really help would be some

Re: [Discuss][C++] A framework for contextual/implicit/ambient vars

2023-08-24 Thread Weston Pace

In other languages I have seen this called "async local"[1][2][3].  I'm not
sure of any C++ implementations.  Folly's fibers claim to have fiber-local
variables[4] but I can't find the actual code to use them.  I can't seem to
find reference to the concept in boost's asio or cppcoro.

I've also seen libraries that had higher level contexts.  For example, when
working with HTTP servers there is often a pattern to have "request
context" (e.g. you'll often find things like the IP address of the
requestor, the identity of the calling user, etc.)  I believe these are
typically built on top of thread-local or async-local variables.

I think some kind of async-local concept is a good idea.  We already do
some of this context propagation for open telemetry if I recall correctly.

> - TaskHints (though that doesn't seem to be used currently?)

I'm not aware of any usage of this.  My understanding is that this was
originally intended to provide hints to some kind of scheduler on how to
prioritize a task.  I think this concept can probably be removed.

[1]
https://learn.microsoft.com/en-us/dotnet/api/system.threading.asynclocal-1?view=net-7.0
[2] https://docs.rs/async-local/latest/async_local/
[3] https://nodejs.org/api/async_context.html
[4] https://github.com/facebook/folly/blob/main/folly/fibers/README.md

On Thu, Aug 24, 2023 at 6:22 AM Antoine Pitrou  wrote:

>
> Hello,
>
> Arrow C++ comes with execution facilities (such as thread pools, async
> generators...) meant to unlock higher performance by hiding IO latencies
> and exploiting several CPU cores. These execution facilities also
> obscure the context in which a task is executed: you cannot simply use
> local, global or thread-local variables to store ancillary parameters.
>
> Over the years we have started adding optional metadata that can be
> associated with tasks:
>
> - StopToken
> - TaskHints (though that doesn't seem to be used currently?)
> - some people have started to ask about IO tags:
> https://github.com/apache/arrow/issues/37267
>
> However, any such additional metadata must currently be explicitly
> passed to all tasks that might make use of them.
>
> My questions are thus:
>
> - do we want to continue using the explicit passing style?
> - on the contrary, do we want to switch to a paradigm where those, once
> set, are propagated implicitly along the task dependency flow (e.g. from
> the caller of Executor::Submit to the task submitted)
> - are there useful or insightful precedents in the C++ ecosystem?
>
> (note: a similar facility in Python is brought by "context vars":
> https://docs.python.org/3/library/contextvars.html)
>
> Regards
>
> Antoine.
>

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread Weston Pace

+1

Thanks to all for the discussion and thanks to Ben for all of the great
work.


On Mon, Aug 21, 2023 at 9:16 AM wish maple  wrote:

> +1 (non-binding)
>
> It would help a lot when processing UTF-8 related data!
>
> Xuwei
>
> Andrew Lamb  于2023年8月22日周二 00:11写道：
>
> > +1
> >
> > This is a great example of collaboration
> >
> > On Sat, Aug 19, 2023 at 4:10 PM Chao Sun  wrote:
> >
> > > +1 (non-binding)!
> > >
> > > On Fri, Aug 18, 2023 at 12:59 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com> wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Fri, 18 Aug 2023 at 18:48 Jacob Wujciak-Jens
> > > >  wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On Fri, Aug 18, 2023 at 6:04 PM L. C. Hsieh 
> > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Fri, Aug 18, 2023 at 5:53 AM Neal Richardson
> > > > > >  wrote:
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Thanks all for the thoughtful discussions here.
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > > On Fri, Aug 18, 2023 at 4:14 AM Raphael Taylor-Davies
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Despite my earlier misgivings, I think this will be a
> valuable
> > > > > addition
> > > > > > > > to the specification.
> > > > > > > >
> > > > > > > > To clarify I've interpreted this as a vote on both Utf8View
> and
> > > > > > > > BinaryView as in the linked PR.
> > > > > > > >
> > > > > > > > On 28/06/2023 20:34, Benjamin Kietzman wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I'd like to propose adding Utf8View arrays to the arrow
> > format.
> > > > > > > > > Previous discussion in [1], columnar format description in
> > [2],
> > > > > > > > > flatbuffers changes in [3].
> > > > > > > > >
> > > > > > > > > There are implementations available in both C++[4] and
> Go[5]
> > > > which
> > > > > > > > > exercise the new type over IPC. Utf8View format
> > demonstrates[6]
> > > > > > > > > significant performance benefits over Utf8 in common tasks.
> > > > > > > > >
> > > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > > >
> > > > > > > > > [ ] +1 add the proposed Utf8View type to the Apache Arrow
> > > format
> > > > > > > > > [ ] -1 do not add the proposed Utf8View type to the Apache
> > > Arrow
> > > > > > format
> > > > > > > > > because...
> > > > > > > > >
> > > > > > > > > Sincerely,
> > > > > > > > > Ben Kietzman
> > > > > > > > >
> > > > > > > > > [1]
> > > > > https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
> > > > > > > > > [4] https://github.com/apache/arrow/pull/35628
> > > > > > > > > [5] https://github.com/apache/arrow/pull/35769
> > > > > > > > > [6]
> > > > > >
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Acero and Substrait: How to select struct field from a struct column?

2023-08-07 Thread Weston Pace

> But I can't figure out how to express "select struct field 0 from field 2
> of the original table where field 2 is a struct column"
>
> Any idea how the substrait message should look like for the above?

I believe it would be:

```
"expression": {
  "selection": {
"direct_reference": {
  "struct_field" {
"field": 2,
"child" {
  "struct_field" {  "field": 0 }
}
  }
}
"root_reference": { }
  }
}
```

To get the above I used the following python (requires [1] which could use
a review and you need some way to convert the binary substrait to json, I
used a script I have lying around):

```
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> schema = pa.schema([pa.field("points", pa.struct([pa.field("x",
pa.float64()), pa.field("y", pa.float64())]))])
>>> expr = pc.field(("points", "x"))
>>> expr.to_substrait(schema)

```

[1] https://github.com/apache/arrow/pull/34834

On Tue, Aug 1, 2023 at 1:45 PM Li Jin  wrote:

> Hi,
>
> I am recently trying to do
> (1) assign a struct type column s
> (2) flatten the struct columns (by assign v1=s[v1], v2=s[v2] and drop the s
> column)
>
> via Substrait and Acero.
>
> However, I ran into the problem where I don't know the proper substrait
> message to encode this (for (2))
>
> Normally, if I select a column from the origin table, it would look like
> this (e.g, select column index 1 from the original table):
>
> selection {
>   direct_reference {
> struct_field {
> 1
> }
>   }
> }
>
> But I can't figure out how to express "select struct field 0 from field 2
> of the original table where field 2 is a struct column"
>
> Any idea how the substrait message should look like for the above?
>

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-02 Thread Weston Pace

> I would welcome a draft PR showcasing the changes necessary in the IPC
> format definition, and in the C Data Interface specification (no need to
> actually implement them for now :-)).

I've proposed something at [1].

> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).

I think this is similar to the proposal with the exception that your
suggestion would require amending existing types that happen to be
alternatives to each other.  I'm not opposed to it but I think it's
compatible and we don't necessarily need all of the complexity just yet
(feel free to correct me if I'm wrong).  I don't think we need to introduce
the concept of "kind".  We already have a concept of "logical type" in the
spec.  I think what you are stating is that a single logical type may have
multiple physical layouts.  I agree.  E.g. variable size list<32>, variable
size list<64>, and REE are the physical layouts that, combined with the
logical type "string", give you "string", "large string", and "ree"

[1] https://github.com/apache/arrow/pull/37000

On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho 
wrote:

> A major difficulty in making the Arrow array types open for extension [1]
> is that as soon as we define an (a) universal representation* or (b)
> abstract interface, we close the door for vectorization. (a) prevents
> having new vectorization friendly formats and (b) limits the implementation
> of new vectorized operations. This is an instance of the “expression
> problem” [2].
>
> The way Arrow currently “solves” the data abstraction problem is by having
> no data abstraction — every operation takes a type and should provide
> specializations for every type. Sometimes it’s possible to re-use the same
> kernel for different types, but the general approach is that we specialize
> (in the case of C++, we sometimes can specialize by just instantiating a
> template, but that’s still an specialization).
>
> Given these constraints, what could be done?
>
> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).
>
> Then when different implementations have to communicate or interoperate,
> they have to only be up to date on the list of Arrow Kinds and before data
> is moved a conversion step between types within the same kind is performed
> if required to make that communication possible.
>
> Example: a system that has a string_view Array and needs to send that array
> to a system that only understands large_string instances of the string kind
> MUST perform a conversion. This means that as long as all Arrow
> implementations understand one established type on each of the kinds, they
> can communicate.
>
> This imposes a reasonable requirement on new types: when introduced, they
> should come with conversions to the previously specified types on that
> kind.
>
> Any thoughts?
>
> —
> Felipe
> Voltron Data
>
>
> [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
> [2] https://en.wikipedia.org/wiki/Expression_problem
>
> * “an array is a list of buffers and child arrays” doesn’t qualify as
> “universal representation” because it doesn’t make a commitment on what all
> the buffers and child arrays mean universally
>
> ** if kind is already taken to mean scalar/array, we can use the term
> “sort”
>
> On Mon, 31 Jul 2023 at 04:39 Gang Wu  wrote:
>
> > I am also in favor of the idea of an alternative layout. IIRC, a new
> > alternative
> > layout still goes into a process of standardization though it is the
> choice
> > of
> > each implementation to decide support now or later. I'd like to ask if we
> > can
> > provide the flexibility for implementations or downstream projects to
> > actually
> > implement a new alternative layout by means of a pluggable interface
> before
> > starting the standardization process. This is similar to promoting a
> > popular
> > extension type implemented by many users to a canonical extension type.
> > I know this is more complicated as extension type simply reuses existing
> > layout but alternative layout usually means a brand new one. For example,
> > if two projects speak Arrow and now they want to share a new layout, they
> > can simply implement a pluggable alternative layout before Arrow adopts
> it.
> > This can unblock projects to evolve and help Arrow not to be fragmented.
> >
> > Best,
> > Gang
> >
> > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Hello,
> > >
> > > I'm trying to reason about the advantages and drawbacks of this
> > > proposal, but it seems to me that it lacks definition.
> > >
> > > I would welcome a draft PR showcasing the changes necessary in the IPC
> > > format definition, and in the C Data Interface specification (no need
> to

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-31 Thread Weston Pace

Thanks.  This is a very helpful reproduction.

I was able to reproduce and diagnose the problem.  There is a bug on our
end and I have filed [1] to address it.  There are a lot more details in
the ticket if you are interested.  In the meantime, the only workaround I
can think of is probably to slow down the data source enough that the queue
doesn't fill up.

[1] https://github.com/apache/arrow/issues/36951


On Sun, Jul 30, 2023 at 8:15 PM Wenbo Hu  wrote:

> Hi,
> The following code should reproduce the problem.
>
> ```
>
> import pyarrow as pa
> import pyarrow.fs, pyarrow.dataset
>
> schema = pa.schema([("id", pa.utf8()), ("bucket", pa.uint8())])
>
>
> def rb_generator(buckets, rows, batches):
> batch = pa.record_batch(
> [[f"id-{i}" for i in range(rows)], [i % buckets for i in
> range(rows)]],
> schema=schema,
> )
>
> for i in range(batches):
> yield batch
> print(f"yielding {i}")
>
>
> if __name__ == "__main__":
> pa.set_io_thread_count(1)
> reader = pa.RecordBatchReader.from_batches(schema,
> rb_generator(64, 32768, 100))
> local_fs = pa.fs.LocalFileSystem()
>
> pa.dataset.write_dataset(
> reader,
> "/tmp/data_f",
> format="feather",
> partitioning=["bucket"],
> filesystem=local_fs,
> existing_data_behavior="overwrite_or_ignore"
> )
>
> ```
>
> Wenbo Hu  于2023年7月30日周日 15:30写道：
> >
> > Hi,
> >Then another question is that "why back pressure not working on the
> > input stream of write_dataset api?". If back pressure happens on the
> > end of the acero stream for some reason (on queue stage or write
> > stage), then the input stream should backpressure as well? It should
> > keep the memory to a stable level so that the input speed would match
> > the output speed.
> >
> > Then, I made some other experiments with various io_thread_count
> > values and bucket_size (partitions/opening files).
> >
> > 1. for bucket_size to 64 and io_thread_count/cpu_count to 1, the cpu
> > is up to 100% after transferring done, but there is a very interesting
> > output.
> > * flow transferring from client to server at the very first few
> > batches are slow, less than 0.01M rows/s, then it speeds up to over 4M
> > rows/s very quickly.
> > * I think at the very beginning stage, the backpressure works
> > fine, until sometime, like the previous experiments, the backpressure
> > makes the stream into a blackhole, then the io thread input stream
> > stuck at some slow speed. (It's still writing, but takes a lot of time
> > on waiting upstream CPU partitioning threads to push batches?)
> > * from iotop, the total disk write is dropping down very slowly
> > after transferring done. But it may change over different experiments
> > with the same configuration. I think the upstream backpressure is not
> > working as expected, which makes the downstream writing keep querying.
> > I think it may reveal something, maybe at some point, the slow writing
> > enlarge the backpressure on the whole process (the write speed is
> > dropping slowly), but the slow reason of writing is the upstream is
> > already slow down.
> >
> > 2. Then I set cpu_count to 64
> > 2.1 io_thread_count to 4.
> > 2.1.1 , for bucket_size to 2/4/6, The system works fine. CPU is less
> > than 100%. Backpressure works fine, memory will not accumulated much
> > before the flow speed becomes stable.
> > 2.1.2  when bucket_size to 8, the bug comes back. After transferring
> > done, CPU is about 350% (only io thread is running?) and write from
> > iotop is about 40M/s, then dropping down very slowly.
> >
> > 2.2. then I set both io_thread to 6,
> > 2.2.1 for bucket_size to 6/8/16, The system works fine. CPU is about
> > 100%. like 2.1.1
> > 2.2.2 for bucket_size to 32, the bug comes back. CPU halts at 550%.
> >
> > 2.3 io_thread_count to 8
> > 2.3.1 for bucket_size to 16, it fails somehow. After transferring
> > done, the memory accumulated over 3G, but write speed is about 60M/s,
> > which makes it possible to wait. CPU is about 600~700%. After the
> > accumulated memory writing, CPU becomes normal.
> > 2.3.2 for bucket_size to 32, it still fails. CPU halts at 800%.
> > transferring is very fast (over 14M rows/s). the backpressure is not
> > working at all.
> >
> >
> > Weston Pace  于2023年7月29日周六 01:08写道：
> > >
> > > > How many io threads are writi

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-28 Thread Weston Pace

nt to 1, io_thread_count to 128, CPU goes to 800% much
> slower than record2 (due to slower flow speed?), RES is growing to 30G
> to the end of transferring, average flow speed is 1.62M rows/s. Same
> happens as record 1 after transferring done.
>
> Then I'm trying to limit the flow speed before writing queue got full
> with custom flow control (sleep on reader iteration based on available
> memory) But the sleep time curve is not accurate, sometimes flow slows
> down, but the queue got full anyway.
> Then the interesting thing happens, before the queue is full (memory
> quickly grows up), the CPU is not fully used. When memory grows up
> quickly, CPU goes up as well, to 800%.
> 1. Sometimes, the writing queue can overcome, CPU will goes down after
> the memory accumulated. The writing speed recoved and memory back to
> normal.
> 2. Sometimes, it can't. IOBPS goes down sharply, and CPU never goes
> down after that.
>
> How many io threads are writing concurrently in a single write_dataset
> call? How do they schedule? The throttle code seems only one task got
> running?
> What else can I do to inspect the problem?
>
> Weston Pace  于2023年7月28日周五 00:33写道：
> >
> > You'll need to measure more but generally the bottleneck for writes is
> > usually going to be the disk itself.  Unfortunately, standard OS buffered
> > I/O has some pretty negative behaviors in this case.  First I'll describe
> > what I generally see happen (the last time I profiled this was a while
> back
> > but I don't think anything substantial has changed).
> >
> > * Initially, writes are very fast.  The OS `write` call is simply a
> memcpy
> > from user space into kernel space.  The actual flushing the data from
> > kernel space to disk happens asynchronously unless you are using direct
> I/O
> > (which is not currently supported).
> > * Over time, assuming the data arrival rate is faster than the data write
> > rate, the data will accumulate in kernel memory.  For example, if you
> > continuously run the Linux `free` program you will see the `free` column
> > decrease and the `buff/cache` column decreases.  The `available` column
> > should generally stay consistent (kernel memory that is in use but can
> > technically be flushed to disk if needed is still considered "available"
> > but not "free")
> > * Once the `free` column reaches 0 then a few things happen.  First, the
> > calls to `write` are no longer fast (the write cannot complete until some
> > existing data has been flushed to disk).  Second, other processes that
> > aren't in use might start swapping their data to disk (you will generally
> > see the entire system, if it is interactive, grind to a halt).  Third, if
> > you have an OOM-killer active, it may start to kill running applications.
> > It isn't supposed to do so but there are sometimes bugs[1].
> > * Assuming the OOM killer does not kill your application then, because
> the
> > `write` calls slow down, the number of rows in the dataset writer's queue
> > will start to fill up (this is captured by the variable
> > `rows_in_flight_throttle`.
> > * Once the rows_in_flight_throttle is full it will pause and the dataset
> > writer will return an unfinished future (asking the caller to back off).
> > * Once this happens the caller will apply backpressure (if being used in
> > Acero) which will pause the reader.  This backpressure is not instant and
> > generally each running CPU thread still delivers whatever batch it is
> > working on.  These batches essentially get added to an asynchronous
> > condition variable waiting on the dataset writer queue to free up.  This
> is
> > the spot where the ThrottledAsyncTaskScheduler is used.
> >
> > The stack dump that you reported is not exactly what I would have
> expected
> > but it might still match the above description.  At this point I am just
> > sort of guessing.  When the dataset writer frees up enough to receive
> > another batch it will do what is effectively a "notify all" and all of
> the
> > compute threads are waking up and trying to add their batch to the
> dataset
> > writer.  One of these gets through, gets added to the dataset writer, and
> > then backpressure is applied again and all the requests pile up once
> > again.  It's possible that a "resume sending" signal is sent and this
> might
> > actually lead to RAM filling up more.  We could probably mitigate this by
> > adding a low water mark to the dataset writer's backpressure throttle (so
> > it doesn't send the resume signal as s

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-27 Thread Weston Pace

You'll need to measure more but generally the bottleneck for writes is
usually going to be the disk itself.  Unfortunately, standard OS buffered
I/O has some pretty negative behaviors in this case.  First I'll describe
what I generally see happen (the last time I profiled this was a while back
but I don't think anything substantial has changed).

* Initially, writes are very fast.  The OS `write` call is simply a memcpy
from user space into kernel space.  The actual flushing the data from
kernel space to disk happens asynchronously unless you are using direct I/O
(which is not currently supported).
* Over time, assuming the data arrival rate is faster than the data write
rate, the data will accumulate in kernel memory.  For example, if you
continuously run the Linux `free` program you will see the `free` column
decrease and the `buff/cache` column decreases.  The `available` column
should generally stay consistent (kernel memory that is in use but can
technically be flushed to disk if needed is still considered "available"
but not "free")
* Once the `free` column reaches 0 then a few things happen.  First, the
calls to `write` are no longer fast (the write cannot complete until some
existing data has been flushed to disk).  Second, other processes that
aren't in use might start swapping their data to disk (you will generally
see the entire system, if it is interactive, grind to a halt).  Third, if
you have an OOM-killer active, it may start to kill running applications.
It isn't supposed to do so but there are sometimes bugs[1].
* Assuming the OOM killer does not kill your application then, because the
`write` calls slow down, the number of rows in the dataset writer's queue
will start to fill up (this is captured by the variable
`rows_in_flight_throttle`.
* Once the rows_in_flight_throttle is full it will pause and the dataset
writer will return an unfinished future (asking the caller to back off).
* Once this happens the caller will apply backpressure (if being used in
Acero) which will pause the reader.  This backpressure is not instant and
generally each running CPU thread still delivers whatever batch it is
working on.  These batches essentially get added to an asynchronous
condition variable waiting on the dataset writer queue to free up.  This is
the spot where the ThrottledAsyncTaskScheduler is used.

The stack dump that you reported is not exactly what I would have expected
but it might still match the above description.  At this point I am just
sort of guessing.  When the dataset writer frees up enough to receive
another batch it will do what is effectively a "notify all" and all of the
compute threads are waking up and trying to add their batch to the dataset
writer.  One of these gets through, gets added to the dataset writer, and
then backpressure is applied again and all the requests pile up once
again.  It's possible that a "resume sending" signal is sent and this might
actually lead to RAM filling up more.  We could probably mitigate this by
adding a low water mark to the dataset writer's backpressure throttle (so
it doesn't send the resume signal as soon as the queue has room but waits
until the queue is half full).

I'd recommend watching the output of `free` (or monitoring memory in some
other way) and verifying the above.  I'd also suggest lowering the number
of CPU threads and see how that affects performance.  If you lower the CPU
threads enough then you should eventually get it to a point where your
supply of data is slower then your writer and I wouldn't expect memory to
accumulate.  These things are solutions but might give us more clues into
what is happening.

[1]
https://unix.stackexchange.com/questions/300106/why-is-the-oom-killer-killing-processes-when-swap-is-hardly-used

On Thu, Jul 27, 2023 at 4:53 AM Wenbo Hu  wrote:

> Hi,
> I'm using flight to receive streams from client and write to the
> storage with python `pa.dataset.write_dataset` API. The whole data is
> 1 Billion rows, over 40GB with one partition column ranges from 0~63.
> The container runs at 8-cores CPU and 4GB ram resources.
> It can be done about 160s (6M rows/s, each record batch is about
> 32K rows) for completing transferring and writing almost
> synchronously, after setting 128 for io_thread_count.
>  Then I'd like to find out the bottleneck of the system, CPU or
> RAM or storage?
> 1. I extend the ram into 32GB, then the server consumes more ram,
> the writing progress works at the beginning, then suddenly slow down
> and the data accumulated into ram until OOM.
> 2. Then I set the ram to 64GB, so that the server will not killed
> by OOM. Same happens, also, after all the data is transferred (in
> memory), the server consumes all CPU shares (800%), but still very
> slow on writing (not totally stopped, but about 100MB/minute).
> 2.1 I'm wondering if the io thread is stuck, or the computation
> task is stuck. After setting both io_thread_count and cpu_count to 32,
> I wrapped the

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-26 Thread Weston Pace

Also, if you haven't seen it yet, the 13.0.0 release adds considerably more
documentation around Acero, including the scheduler:

https://arrow.apache.org/docs/dev/cpp/acero/developer_guide.html#scheduling-and-parallelism

On Wed, Jul 26, 2023 at 10:13 AM Li Jin  wrote:

> Thanks Weston! Very helpful explanation.
>
> On Tue, Jul 25, 2023 at 6:41 PM Weston Pace  wrote:
>
> > 1) As a rule of thumb I would probably prefer `async_scheduler`.  It's
> more
> > feature rich and simpler to use and is meant to handle "long running"
> tasks
> > (e.g. 10s-100s of ms or more).
> >
> > The scheduler is a bit more complex and is intended for very fine-grained
> > scheduling.  It's currently only used in a few nodes, I think the
> hash-join
> > and the hash-group-by for things like building the hash table (after the
> > build data has been accumulated).
> >
> > 2) Neither scheduler manages threads.  Both of them rely on the executor
> in
> > ExecContext::executor().  The scheduler takes a "schedule task callback"
> > which it expects to do the actual executor submission.  The async
> scheduler
> > uses futures and virtual classes.  A "task" is something that can be
> called
> > which returns a future that will be completed when the task is complete.
> > Most of the time this is done by submitting something to an executor (in
> > return for a future).  Sometimes this is done indirectly, for example, by
> > making an async I/O call (which under the hood is usually implemented by
> > submitting something to the I/O executor).
> >
> > On Tue, Jul 25, 2023 at 2:56 PM Li Jin  wrote:
> >
> > > Hi,
> > >
> > > I am reading Acero and got confused about the use of
> > > QueryContext::scheduler() and QueryContext::async_scheduler(). So I
> have
> > a
> > > couple of questions:
> > >
> > > (1) What are the different purposes of these two?
> > > (2) Does scheduler/aysnc_scheduler own any threads inside their
> > respective
> > > classes or do they use the thread pool from ExecContext::executor()?
> > >
> > > Thanks,
> > > Li
> > >
> >
>

Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace

> Replacing ... with ... works as expected

This is, I think, because the RecordBatchSourceNode defaults to implicit
ordering (note the RecordBatchSourceNode is a SchemaSourceNode):

```
struct SchemaSourceNode : public SourceNode {
  SchemaSourceNode(ExecPlan* plan, std::shared_ptr schema,
   arrow::AsyncGenerator>
generator)
  : SourceNode(plan, schema, generator, Ordering::Implicit()) {}
```

It seems a little inconsistent that the RecordBatchSourceNode defaults to
implicit and the RecordBatchReader source does not.  It would be nice to
fix those (as I described above it is probably ok to assume an implicit
ordering in many cases).

On Wed, Jul 26, 2023 at 8:18 AM Weston Pace  wrote:

> > I think the key problem is that the input stream is unordered. The
> > input stream is a ArrowArrayStream imported from python side, and then
> > declared to a "record_batch_reader_source", which is a unordered
> > source node. So the behavior is expected.
> >   I think the RecordBatchReaderSourceOptions should add an ordering
> > parameter to indicate the input stream ordering. Otherwise, I need to
> > convert the record_batch_reader_source into a "record_batch_source"
> > with a record_batch generator.
>
> I agree.  Also, keep in mind that there is the "implicit order" which
> means "this source node has some kind of deterministic ordering even if it
> isn't reflected in any single column".  In other words, if you have a CSV
> file for example it will always be implicitly ordered (by line number) even
> if the line number isn't a column.  This should allow us to do things like
> "grab the first 10 rows" and get behavior the user expects even if their
> data isn't explicitly ordered.  In most cases we can assume that data has
> some kind of batch order.  The only time it does not is if the source
> itself is non-deterministic.  For example, maybe the source is some kind of
> unordered scan from an external SQL source.
>
> > Also, I'd like to have a discuss on dataset scanner, is it produce a
> > stable sequence of record batches (as an implicit ordering) when the
> > underlying storage is not changed?
>
> Yes, both the old and new scan node are capable of doing this.  The
> implicit order is given by the order of the fragments in the dataset (which
> we assume will always be consistent and, in the case of a
> FileSystemDataset, it is).  In the old scan node you need to set the
> property `require_sequenced_output` on the ScanNodeOptions to true (I
> believe the new scan node will always sequence output but this property may
> eventually exist there too).
>
> > For my situation, the downstream
> > executor may crush, then it would request to continue from a
> > intermediate state (with a restart offset). I'd like to make it into a
> > fetch node to skip heading rows, but it seems not an optimized way.
>
> Regrettably the old scan node does not have skip implemented.  It is a
> little tricky since we do not have a catalog and thus do not know how many
> rows every single file has.  So we have to calculate the skip at runtime.
> I am planning to support this in the new scan node.
>
> > Maybe I should inspect fragments in the dataset, to skip reading
> > unnecessary files, and build a FlieSystemDataset on the fly?
>
> Yes, this should work today.
>
>
> On Tue, Jul 25, 2023 at 10:37 PM Wenbo Hu  wrote:
>
>> Replacing
>> ```
>> ac::Declaration source{"record_batch_reader_source",
>> ac::RecordBatchReaderSourceNodeOptions{std::move(input)}};
>> ```
>> with
>> ```
>> ac::RecordBatchSourceNodeOptions rb_source_options{
>> input->schema(), [input]() { return
>> arrow::MakeFunctionIterator([input] { return input->Next(); }); }};
>> ac::Declaration source{"record_batch_source",
>> std::move(rb_source_options)};
>> ```
>> Works as expected.
>>
>> Wenbo Hu  于2023年7月26日周三 10:22写道：
>> >
>> > Hi,
>> >   I'll open a issue on the DeclareToReader problem.
>> >   I think the key problem is that the input stream is unordered. The
>> > input stream is a ArrowArrayStream imported from python side, and then
>> > declared to a "record_batch_reader_source", which is a unordered
>> > source node. So the behavior is expected.
>> >   I think the RecordBatchReaderSourceOptions should add an ordering
>> > parameter to indicate the input stream ordering. Otherwise, I need to
>> > convert the record_batch_reader_source into a "record_batch_source"
>> > with a record_batch generator.
>> >

Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace

> I think the key problem is that the input stream is unordered. The
> input stream is a ArrowArrayStream imported from python side, and then
> declared to a "record_batch_reader_source", which is a unordered
> source node. So the behavior is expected.
>   I think the RecordBatchReaderSourceOptions should add an ordering
> parameter to indicate the input stream ordering. Otherwise, I need to
> convert the record_batch_reader_source into a "record_batch_source"
> with a record_batch generator.

I agree.  Also, keep in mind that there is the "implicit order" which means
"this source node has some kind of deterministic ordering even if it isn't
reflected in any single column".  In other words, if you have a CSV file
for example it will always be implicitly ordered (by line number) even if
the line number isn't a column.  This should allow us to do things like
"grab the first 10 rows" and get behavior the user expects even if their
data isn't explicitly ordered.  In most cases we can assume that data has
some kind of batch order.  The only time it does not is if the source
itself is non-deterministic.  For example, maybe the source is some kind of
unordered scan from an external SQL source.

> Also, I'd like to have a discuss on dataset scanner, is it produce a
> stable sequence of record batches (as an implicit ordering) when the
> underlying storage is not changed?

Yes, both the old and new scan node are capable of doing this.  The
implicit order is given by the order of the fragments in the dataset (which
we assume will always be consistent and, in the case of a
FileSystemDataset, it is).  In the old scan node you need to set the
property `require_sequenced_output` on the ScanNodeOptions to true (I
believe the new scan node will always sequence output but this property may
eventually exist there too).

> For my situation, the downstream
> executor may crush, then it would request to continue from a
> intermediate state (with a restart offset). I'd like to make it into a
> fetch node to skip heading rows, but it seems not an optimized way.

Regrettably the old scan node does not have skip implemented.  It is a
little tricky since we do not have a catalog and thus do not know how many
rows every single file has.  So we have to calculate the skip at runtime.
I am planning to support this in the new scan node.

> Maybe I should inspect fragments in the dataset, to skip reading
> unnecessary files, and build a FlieSystemDataset on the fly?

Yes, this should work today.

On Tue, Jul 25, 2023 at 10:37 PM Wenbo Hu  wrote:

> Replacing
> ```
> ac::Declaration source{"record_batch_reader_source",
> ac::RecordBatchReaderSourceNodeOptions{std::move(input)}};
> ```
> with
> ```
> ac::RecordBatchSourceNodeOptions rb_source_options{
> input->schema(), [input]() { return
> arrow::MakeFunctionIterator([input] { return input->Next(); }); }};
> ac::Declaration source{"record_batch_source",
> std::move(rb_source_options)};
> ```
> Works as expected.
>
> Wenbo Hu  于2023年7月26日周三 10:22写道：
> >
> > Hi,
> >   I'll open a issue on the DeclareToReader problem.
> >   I think the key problem is that the input stream is unordered. The
> > input stream is a ArrowArrayStream imported from python side, and then
> > declared to a "record_batch_reader_source", which is a unordered
> > source node. So the behavior is expected.
> >   I think the RecordBatchReaderSourceOptions should add an ordering
> > parameter to indicate the input stream ordering. Otherwise, I need to
> > convert the record_batch_reader_source into a "record_batch_source"
> > with a record_batch generator.
> >   Also, I'd like to have a discuss on dataset scanner, is it produce a
> > stable sequence of record batches (as an implicit ordering) when the
> > underlying storage is not changed? For my situation, the downstream
> > executor may crush, then it would request to continue from a
> > intermediate state (with a restart offset). I'd like to make it into a
> > fetch node to skip heading rows, but it seems not an optimized way.
> > Maybe I should inspect fragments in the dataset, to skip reading
> > unnecessary files, and build a FlieSystemDataset on the fly?
> >
> > Weston Pace  于2023年7月25日周二 23:44写道：
> > >
> > > > Reading the source code of exec_plan.cc, DeclarationToReader called
> > > > DeclarationToRecordBatchGenerator, which ignores the sequence_output
> > > > parameter in SinkNodeOptions, also, it calls validate which should
> > > > fail if the SinkNodeOptions honors the sequence_output. Then it seems
> > > > that DeclarationToReader cannot

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Weston Pace

1) As a rule of thumb I would probably prefer `async_scheduler`.  It's more
feature rich and simpler to use and is meant to handle "long running" tasks
(e.g. 10s-100s of ms or more).

The scheduler is a bit more complex and is intended for very fine-grained
scheduling.  It's currently only used in a few nodes, I think the hash-join
and the hash-group-by for things like building the hash table (after the
build data has been accumulated).

2) Neither scheduler manages threads.  Both of them rely on the executor in
ExecContext::executor().  The scheduler takes a "schedule task callback"
which it expects to do the actual executor submission.  The async scheduler
uses futures and virtual classes.  A "task" is something that can be called
which returns a future that will be completed when the task is complete.
Most of the time this is done by submitting something to an executor (in
return for a future).  Sometimes this is done indirectly, for example, by
making an async I/O call (which under the hood is usually implemented by
submitting something to the I/O executor).

On Tue, Jul 25, 2023 at 2:56 PM Li Jin  wrote:

> Hi,
>
> I am reading Acero and got confused about the use of
> QueryContext::scheduler() and QueryContext::async_scheduler(). So I have a
> couple of questions:
>
> (1) What are the different purposes of these two?
> (2) Does scheduler/aysnc_scheduler own any threads inside their respective
> classes or do they use the thread pool from ExecContext::executor()?
>
> Thanks,
> Li
>

Re: how to make acero output order by batch index

2023-07-25 Thread Weston Pace

> Reading the source code of exec_plan.cc, DeclarationToReader called
> DeclarationToRecordBatchGenerator, which ignores the sequence_output
> parameter in SinkNodeOptions, also, it calls validate which should
> fail if the SinkNodeOptions honors the sequence_output. Then it seems
> that DeclarationToReader cannot follow the input batch order?

These methods should not be ignoring sequence_output.  Do you want to open
a bug?  This should be a straightforward one to fix.

> Then how the substrait works in this scenario? Does it output
> disorderly as well?

Probably.  Much of internal Substrait testing is probably using
DeclarationToTable or DeclarationToBatches.  The ordered execution hasn't
been adopted widely yet because the old scanner doesn't set the batch index
and the new scanner isn't ready yet.  This limits the usefulness to data
that is already in memory (the in-memory sources do set the batch index).

I think your understanding of the concept is correct however.  Can you
share a sample plan that is not working for you?  If you use
DeclarationToTable do you get consistently ordered results?

On Tue, Jul 25, 2023 at 7:06 AM Wenbo Hu  wrote:

> Reading the source code of exec_plan.cc, DeclarationToReader called
> DeclarationToRecordBatchGenerator, which ignores the sequence_output
> parameter in SinkNodeOptions, also, it calls validate which should
> fail if the SinkNodeOptions honors the sequence_output. Then it seems
> that DeclarationToReader cannot follow the input batch order?
> Then how the substrait works in this scenario? Does it output
> disorderly as well?
>
> Wenbo Hu  于2023年7月25日周二 19:12写道：
> >
> > Hi,
> > I'm trying to zip two streams with same order but different
> processes.
> > For example, the original stream comes with two column 'id' and
> > 'age', and splits into two stream processed distributedly using acero:
> > 1. hash the 'id' into a stream with single column 'bucket_id' and 2.
> > classify 'age' into ['child', 'teenage', 'adult',...]. And then zip
> > into a single stream.
> >
> >[  'id'  |  'age'  | many other columns]
> > |  ||
> >['bucket_id']   ['classify']|
> >  |  |   |
> >   [zipped_stream | many_other_columns]
> > I was expecting both bucket_id and classify can keep the same order as
> > the orginal stream before they are zipped.
> > According to document, "ordered execution" is using batch_index to
> > indicate the order of batches.
> > but acero::DeclarationToReader with a QueryOptions that sequce_output
> > is set to true does not mean that it keeps the order if the input
> > stream is not ordered. But it doesn't fail during the execution
> > (bucket_id and classify are not specify any ordering). Then How can I
> > make the acero produce a stream that keep the order as the original
> > input?
> > --
> > -
> > Best Regards,
> > Wenbo Hu,
>
>
>
> --
> -
> Best Regards,
> Wenbo Hu,
>

Re: hashing Arrow structures

2023-07-24 Thread Weston Pace

> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?

It's not user-configurable.  The hash-join and hash-group-by always use the
32-bit variant.  The asof-join always uses the 64-bit variant.  I wouldn't
stress too much about the hash-join.  It is a very memory intensive
operation and my guess is that by the time you have enough keys to worry
about hash uniqueness you should probably be doing an out-of-core join
anyways.  The hash-join implementation is also fairly tolerant to duplicate
keys anyways.  I believe our hash-join performance is unlikely to be the
bottleneck in most cases.

It might make more sense to use the 64-bit variant for the group-by, as we
are normally only storing the hash-to-group-id table itself in those
cases.  Solid benchmarking would probably be needed regardless.

On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou  wrote:

>
> Hi,
>
> Le 21/07/2023 à 15:58, Yaron Gvili a écrit :
> > A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> Ouch... so if N/A does have the same hash value as a common non-null
> value (0), this should be fixed.
>
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?
>
> I don't think 32-bit hashing is a good idea when operating on large
> data. Unless the hash function is exceptionally good, you may get lots
> of hash collisions. It's nice to have a SIMD-accelerated hash table, but
> less so if access times degenerate to O(n)...
>
> So IMHO we should only have one hashing variant with a 64-bit output.
> And make sure it doesn't have trivial collisions on common data patterns
> (such as nulls and zeros, or clustered integer ranges).
>
> > A second approach I found is by serializing the Arrow structures
> (possibly by streaming) and hashing using functions in `util/hashing.h`. I
> didn't yet look into what properties these hash functions have except for
> the documented high performance. In particular, I don't know whether they
> have unfortunate hash collisions and, more generally, what is the
> probability of hash collision. I also don't know whether they are designed
> for efficient use in the context of joining.
>
> Those hash functions shouldn't have unfortunate hash, but they were not
> exercised on real-world data at the time. I have no idea whether they
> are efficient in the context of joining, as they have been written much
> earlier than our joining implementation.
>
> Regards
>
> Antoine.
>

Re: hashing Arrow structures

2023-07-21 Thread Weston Pace

Yes, those are the two main approaches to hashing in the code base that I
am aware of as well.  I haven't seen any real concrete comparison and
benchmarks between the two.  If collisions between NA and 0 are a problem
it would probably be ok to tweak the hash value of NA to something unique.
I suspect these collisions aren't an inevitable fact of the design but more
just something that has not been considered yet.

There is a third way currently in use as well by
arrow::compute::GrouperImpl.  In this class the key values from each row
are converted into a single "row format" which is stored in a std::string.
A std::unordered_map is then used for the hashing.  The GrouperFastImpl
class was created to (presumably) be faster than this.  It uses the
Hashing32 routines and stores the groups in an arrow::compute::SwissTable.
However, I think there was some benchmarking done that showed the
GrouperFastIimpl to be faster than the GrouperImpl.

On Fri, Jul 21, 2023 at 6:59 AM Yaron Gvili  wrote:

> Hi,
>
> What are the recommended ways to hash Arrow structures? What are the pros
> and cons of each approach?
>
> Looking a bit through the code, I've so far found two different hashing
> approaches, which I describe below. Are there any others?
>
> A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> A second approach I found is by serializing the Arrow structures (possibly
> by streaming) and hashing using functions in `util/hashing.h`. I didn't yet
> look into what properties these hash functions have except for the
> documented high performance. In particular, I don't know whether they have
> unfortunate hash collisions and, more generally, what is the probability of
> hash collision. I also don't know whether they are designed for efficient
> use in the context of joining.
>
>
> Cheers,
> Yaron.
>

Re: Need help on ArrayaSpan and writing C++ udf

2023-07-17 Thread Weston Pace

> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.

I agree with Jin.  You should probably be incrementing `out` by 32 each
time `VisitValue` is called.

On Mon, Jul 17, 2023 at 6:38 AM Aldrin  wrote:

> Oh wait, I see now that you're incrementing with a uint8_t*. That could be
> fine for your own use, but you might want to make sure it aligns with the
> type of your output (Int64Array vs Int32Array).
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 06:20, Aldrin  > wrote:
>
> Hi Wenbo,
>
> An ArraySpan is like an ArrayData but does not own the data, so the
> ColumnarFormat doc that Jon shared is relevant for both.
>
> In the case of a binary format, the output ArraySpan must have at least 2
> buffers: the offsets and the contiguous binary data (values). If the output
> of your UDF is something like an Int32Array with no nulls, then I think
> you're writing output correctly.
>
> But, since your pointer is a uint8_t, I think Jin is right and `++` is
> going to move your pointer 1 byte instead of 32 bytes like you intend.
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 05:06, Wenbo Hu  > wrote:
>
> Hi Jin,
>
> > but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards?
> I'm implementing the sha256 function as a scalar function, but it
> always inputs with an array, so on visitor pattern, I'll write a 32
> byte hash into the pointer and move to the next for next visit.
> Something like:
> ```
>
> struct BinarySha256Visitor {
> BinarySha256Visitor(uint8_t **out) {
> this->out = out;
> }
> arrow::Status VisitNull() {
> return arrow::Status::OK();
> }
>
> arrow::Status VisitValue(std::string_view v) {
>
> uint8_t hash[32];
> sha256(v, hash);
>
> memcpy(*out++, hash, 32);
>
> return arrow::Status::OK();
> }
>
> uint8_t ** out;
> };
>
> arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
> &batch, cp::ExecResult *out) {
> arrow::ArraySpanVisitor visitor;
>
> auto *out_values = out->array_span_mutable()->GetValues(1);
> BinarySha256Visitor visit(out_values);
> ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));
>
> return arrow::Status::OK();
> }
> ```
> Is it as expected?
>
> Jin Shang  于2023年7月17日周一 19:44写道：
> >
> > Hi Wenbo,
> >
> > I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > > `1` means when `GetValues` called?
> >
> > The meaning of buffers in an ArraySpan depends on the layout of its data
> > type. FixedSizeBinary is a fixed-size primitive type, so it has two
> > buffers, one validity buffer and one data buffer. So GetValues(1) would
> > return a pointer to the data buffer.
> > Layouts of data types can be found here[1].
> >
> > what is the actual type should I get from `GetValues`?
> > >
> > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> > any type to suit your need. The template parameter for GetValue is simply
> > forwarded to reinterpret_cast. There are discussions[2] on the soundness
> of
> > using uint8_t to represent bytes but it is what we use now. Since you are
> > only doing a memcpy, uint8_t should be good.
> >
> > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > I may be missing something, but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards? Otherwise I agree this
> is
> > the way to go.
> >
> > [1]
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> > [2] https://github.com/apache/arrow/issues/36123
> >
> >
> > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu  wrote:
> >
> > > Hi,
> > > I'm using Acero as the stream executor to run large scale data
> > > transformation. The core data used in UDF is `ArraySpan` in
> > > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > > `GetValues` called?
> > > For input data, I can use a `ArraySpanVisitor` to iterator over
> > > different input types. But for output data, I don't know how to write
> > > to the`array_span_mutable()` if it is not a simple c_type.
> > > For example, I'm implementing a sha256 udf, which input is
> > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > > how can I directly write to the out buffers and what is the actual
> > > type should I get from `GetValues`?
> > > Maybe, `auto *out_values =
> > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > > `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > > --
> > > -
> > > Best Regards,
> > > Wenbo Hu,
> > >
>
>
>
> --
> -
> Best Regards,
> Wenbo Hu,
>
>

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-11 Thread Weston Pace

ke formats,
> these operations will likely not be as interoperable as those
> implemented using Arrow. To what extent an engine favours their own
> format(s) over Arrow will be an engineering trade-off they will have to
> make, but DataFusion has found exclusively using Arrow as the
> interchange format between operators to work well.
>
> > There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> I agree 100% that this sort of interoperability is what makes Arrow so
> compelling and something we should work very hard to preserve. This is
> the crux of my concern with standardising alternative layouts. I
> definitely hope that with time Arrow will penetrate deeper into these
> engines, perhaps in a similar manner to DataFusion, as opposed to
> primarily existing at the surface-level.
>
> [1]: https://github.com/apache/arrow-datafusion/pull/6800
>
> On 10/07/2023 11:38, Weston Pace wrote:
> >> The point I was trying to make, albeit very badly, was that these
> >> operations are typically implemented using some sort of row format [1]
> >> [2], and therefore their performance is not impacted by the array
> >> representations. I think it is both inevitable, and in fact something to
> >> be encouraged, that query engines will implement their own in-memory
> >> layouts and data structures outside of the arrow specification for
> >> specific operators, workloads, hardware, etc... This allows them to make
> >> trade-offs based on their specific application domain, whilst also
> >> ensuring that new ideas and approaches can continue to be incorporated
> >> and adopted in the broader ecosystem. However, to then seek to
> >> standardise these layouts seems to be both potentially unbounded scope
> >> creep, and also somewhat counter productive if the goal of
> >> standardisation is improved interoperability?
> > FWIW, I believe this formats are very friendly for row representation as
> > well, especially when stored as a payload (e.g. in a join).
> >
> > For your more general point though I will ask the same question I asked
> on
> > the ArrayView discussion:
> >
> > Is Arrow meant to only be used in between systems (in this case query
> > engines) or is it also meant to be used in between components of a query
> > engine?
> >
> > For example, if someone (datafusion, velox, etc.) were to come up with a
> > framework for UDFs then would batches be passed in and out of those UDFs
> in
> > the Arrow format?  If every engine has its own bespoke formats internally
> > then it seems we are placing a limit on how far things can be decomposed.
> >  From the C++ perspective, I would personally like to see Arrow be usable
> > within components.  There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> >
> > On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
> >  wrote:
> >
> >>> Thus the approach you
> >>> describe for validating an entire character buffer as UTF-8 then
> checking
> >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >> The difference here is that it is perhaps expected for Utf8View to have
> >> gaps in the underlying data that are not referenced as part of any
> >> value, as I had understood this to be one of its benefits over the
> >> current encoding. I think it would therefore be problematic to enforce
> >> these gaps be UTF-8.
> >>
> >>> Furthermore unlike an explicit
> >>> selection vector a kernel may decide to copy and densify dynamically if
> >> it
> >>> detects that output is getting sparse or fragmented
> >> I don't see why you couldn't do something similar to materialize a
> >> sparse selection vector, if anything being able to centralise this logic
> >> outside specific kernels would be advantageous.
> >>
> >>> Specifically sorting and equality comparison
> >>> benefit significantly from the prefix comparison fast path,
> >>> so I'd anticipate that multi column sorting

Re: Confusion on substrait AggregateRel::groupings and Arrow consumer

2023-07-10 Thread Weston Pace

Yes, that is correct.

What Substrait calls "groupings" is what is often referred to in SQL as
"grouping sets".  These allow you to compute the same aggregates but group
by different criteria.  Two very common ways of creating grouping sets are
"group by cube" and "group by rollup".  Snowflake's documentation for
rollup[1] describes the motivation quite well:

> You can think of rollup as generating multiple result sets, each
> of which (after the first) is the aggregate of the previous result
> set. So, for example, if you own a chain of retail stores, you
> might want to see the profit for:
>  * Each store.
>  * Each city (large cities might have multiple stores).
>  * Each state.
>  * Everything (all stores in all states).

Acero does not currently handle more than one grouping set.


[1] https://docs.snowflake.com/en/sql-reference/constructs/group-by-rollup

On Mon, Jul 10, 2023 at 2:22 PM Li Jin  wrote:

> Hi,
>
> I am looking at the substrait protobuf for AggregateRel as well the Acero
> substrait consumer code:
>
>
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc#L851
>
> https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L209
>
> Looks like in subtrait, AggregateRel can have multiple groupings and each
> grouping can have multiple expressions. Let's say now I want to "compute
> sum and mean on column A group by column B, C, D" (for Acero to execute).
> Is the right way to create one grouping with 3 expressions (direct
> reference) for "column B, C, D"?
>
> Thanks,
> Li
>

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Weston Pace

 the prefix comparison fast path.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
> >   wrote:
> >
> >>> I would be interested in hearing some input from the Rust community.
> >>   A couple of thoughts:
> >>
> >> The variable number of buffers would definitely pose some challenges for
> >> the Rust implementation, the closest thing we currently have is possibly
> >> UnionArray, but even then the number of buffers is still determined
> >> statically by the DataType. I therefore also wonder about the
> possibility
> >> of always having a single backing buffer that stores the character data,
> >> including potentially a copy of the prefix. This would also avoid
> forcing a
> >> branch on access, which I would have expected to hurt performance for
> some
> >> kernels quite significantly.
> >>
> >> Whilst not really a concern for Rust, which supports unsigned types, it
> >> does seem inconsistent to use unsigned types where the rest of the
> format
> >> encourages the use of signed offsets, etc...
> >>
> >> It isn't clearly specified whether a null should have a valid set of
> >> offsets, etc... I think it is an important property of the current array
> >> layouts that, with exception to dictionaries, the data in null slots is
> >> arbitrary, i.e. can take any value, but not undefined. This allows for
> >> separate handling of the null mask and values, which can be important
> for
> >> some kernels and APIs.
> >>
> >> More an observation than an issue, but UTF-8 validation for StringArray
> >> can be done very efficiently by first verifying the entire buffer, and
> then
> >> verifying the offsets correspond to the start of a UTF-8 codepoint. This
> >> same approach would not be possible for StringView, which would need to
> >> verify individual values and would therefore be significantly more
> >> expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
> >> validation is perhaps more important for Rust than for other languages.
> >>
> >> I presume that StringView will behave similarly to dictionaries in that
> >> the selection kernels will not recompute the underlying value buffers. I
> >> think this is fine, but it is perhaps worth noting this has caused
> >> confusion in the past, as people somewhat reasonably expect an array
> >> post-selection to have memory usage reflecting the smaller selection.
> This
> >> is then especially noticeable if the data is written out to IPC, and
> still
> >> contains data that was supposedly filtered out. My 2 cents is that
> explicit
> >> selection vectors are a less surprising way to defer selection than
> baking
> >> it into the array, but I also don't have any workloads where this is the
> >> major bottleneck so can't speak authoritatively here.
> >>
> >> Which leads on to my major concern with this proposal, that it adds
> >> complexity and cognitive load to the specification and implementations,
> >> whilst not meaningfully improving the performance of the operators that
> I
> >> commonly encounter as performance bottlenecks, which are multi-column
> sorts
> >> and aggregations, or the expensive string operations such as matching or
> >> parsing. If we didn't already have a string representation I would be
> more
> >> onboard, but as it stands I'm definitely on the fence, especially given
> >> selection performance can be improved in less intrusive ways using
> >> dictionaries or selection vectors.
> >>
> >> Kind Regards,
> >>
> >> Raphael Taylor-Davies
> >>
> >> On 02/07/2023 11:46, Andrew Lamb wrote:
> >>
> >>   * This is the first layout where the number of buffers depends on the
> >>
> >> data
> >>
> >> and not the schema. I think this is the most architecturally significant
> >> fact. I
> >>
> >>   I have spent some time reading the initial proposal -- thank you for
> >> that. I now understand what Weston was saying about the "variable
> numbers
> >> of buffers". I wonder if you considered restricting such arrays to a
> single
> >> buffer (so as to make them more similar to other arrow array types that
> >> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
> >> Pace  <mailto:weston.p...@gmail.com>  wrote:
> >>

Re: Do we need CODEOWNERS ?

2023-07-04 Thread Weston Pace

I agree the experiment isn't working very well.  I've been meaning to
change my listing from `compute` to `acero` for a while.  I'd be +1 for
just removing it though.

On Tue, Jul 4, 2023, 6:44 AM Dewey Dunnington 
wrote:

> Just a note that for me, the main problem is that I get automatic
> review requests for PRs that have nothing to do with R (I think this
> happens when a rebase occurs that contained an R commit). Because that
> happens a lot, it means I miss actual review requests and sometimes
> mentions because they blend in. I think CODEOWNERS results in me
> reviewing more PRs than if I had to set up some kind of custom
> notification filter but I agree that it's not perfect.
>
> Cheers,
>
> -dewey
>
> On Tue, Jul 4, 2023 at 10:04 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > Some time ago we added a `.github/CODEOWNERS` file in the main Arrow
> > repo. The idea is that, when specific files or directories are touched
> > by a PR, specific people are asked for review.
> >
> > Unfortunately, it seems that, most of the time, this produces the
> > following effects:
> >
> > 1) the people who are automatically queried for review don't show up
> > (perhaps they simply ignore those automatic notifications)
> > 2) when several people are assigned for review, each designated reviewer
> > seems to hope that the other ones will be doing the work, instead of
> > doing it themselves
> > 3) contributors expect those people to show up and are therefore
> > bewildered when nobody comes to review their PR
> >
> > Do we want to keep CODEOWNERS? If we still think it can be beneficial,
> > we should institute a policy where people who are listed in that file
> > promise to respond to review requests: 1) either by doing a review 2) or
> > by de-assigning themselves, and if possible pinging another core
> developer.
> >
> > What do you think?
> >
> > Regards
> >
> > Antoine.
>

Re: [ANNOUNCE] New Arrow committer: Kevin Gurney

2023-07-03 Thread Weston Pace

Congratulations Kevin!

On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Kevin Gurney
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> --
> kou
>

Re: Question about large exec batch in acero

2023-07-03 Thread Weston Pace

> is this overflow considered a bug? Or is large exec batch something that
should be avoided?

This is not a bug and it is something that should be avoided.

Some of the hash-join internals expect small batches.  I actually thought
the limit was 32Ki and not 64Ki because I think there may be some places we
are using int16_t as an index.  The reasoning is that the hash-join is
going to make multiple passes through the data (e.g. first to calculate the
hashes from the key columns and then again the encode the key columns,
etc.) and you're going to get better performance when your batches are
small enough that they fit into the CPU cache.  [1] is often given as a
reference for this idea.  Since this is the case there is not much need for
operating on larger batches.

> And does acero have any logic preventing that from happening

Yes, in the source node we take (potentially large) batches from the I/O
side of things and slice them into medium sized batches (about 1Mi rows) to
distribute across threads and then each thread iterates over that medium
sized batch in even smaller batches (32Ki rows) for actual processing.
This all happens here[2].

[1] https://db.in.tum.de/~leis/papers/morsels.pdf
[2]
https://github.com/apache/arrow/blob/6af660f48472b8b45a5e01b7136b9b040b185eb1/cpp/src/arrow/acero/source_node.cc#L120

On Mon, Jul 3, 2023 at 6:50 AM Ruoxi Sun  wrote:

> Hi folks,
>
> I've encountered a bug when doing swiss join using a big exec batch, say,
> larger than 65535 rows, on the probe side. It turns out to be that in the
> algorithm, it is using `uint16_t` to represent the index within the probe
> exec batch (the materialize_batch_ids_buf
> <
> https://github.com/apache/arrow/blob/f951f0c42040ba6f584831621864f5c23e0f023e/cpp/src/arrow/acero/swiss_join.cc#L1897C8-L1897C33
> >),
> and row id larger than 65535 will be silently overflow and cause the result
> nonsense.
>
> One thing to note is that I'm not exactly using the acero "the acero way".
> Instead I carve out some pieces of code from acero and run them
> individually. So I'm just wondering that, is this overflow considered a
> bug? Or is large exec batch something that should be avoided? (And does
> acero have any logic preventing that from happening, e.g., some wild man
> like me just throws it an arbitrary large exec batch?)
>
> Thanks.
>
> *Rossi*
>

Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-29 Thread Weston Pace

Is your use case to operate on a batch of graphs?  For example, do you have
hundreds or thousands of graphs that you need to run these algorithms on at
once?

Or is your use case to operate on a single large graph?  If it's the
single-graph case then how many nodes do you have?

If it's one graph and the graph itself is pretty small and fits into cache,
then I'm not sure the in-memory representation will matter much (though
maybe the search space is large enough to justify a different
representation)


On Thu, Jun 29, 2023 at 6:22 PM Bechir Ben Daadouch 
wrote:

> Dear Apache Arrow Dev Community,
>
> My name is Bechir, I am currently working on a project that involves
> implementing graph algorithms in Apache Arrow.
>
> The initial plan was to construct a node structure and a subsequent graph
> that would encompass all the nodes. However, I quickly realized that due to
> Apache Arrow's columnar format, this approach was not feasible.
>
> I tried a couple of things, including the implementation of the
> shortest-path algorithm. However, I rapidly discovered that manipulating
> arrow objects, particularly when applying graph algorithms, proved more
> complex than anticipated and it became very clear that I would need to
> resort to some data structures outside of what arrow offers (i.e.: Heapq
> wouldn't be possible using arrow).
>
> I also gave a shot at doing it similar to a certain SQL method (see:
> https://ibb.co/0rPGB42 ), but ran into some roadblocks there too and I
> ended up having to resort to using Pandas for some transformations.
>
> My next course of action is to experiment with compressed sparse rows,
> hoping to execute Matrix Multiplication using this method. But honestly,
> with what I know right now, I remain skeptical about the feasibility
> of it. However,
> before committing to this approach, I would greatly appreciate your opinion
> based on your experience with Apache Arrow.
>
> Thank you very much for your time.
>
> Looking forward to potentially discussing this further.
>
> Many thanks,
> Bechir
>

Re: [C++] Dealing with third party method that raises exception

2023-06-29 Thread Weston Pace

We do this quite a bit in the Arrow<->Parquet bridge if IIUC. There are
macros defined like this:

```
#define BEGIN_PARQUET_CATCH_EXCEPTIONS try {
#define END_PARQUET_CATCH_EXCEPTIONS   \
  }\
  catch (const ::parquet::ParquetStatusException& e) { \
return e.status(); \
  }\
  catch (const ::parquet::ParquetException& e) {   \
return ::arrow::Status::IOError(e.what()); \
  }
```

That being said, I'm not particularly fond of macros, but it works.

On Thu, Jun 29, 2023 at 8:09 AM Li Jin  wrote:

> Thanks Antoine - the examples are useful - I can use the same pattern for
> now. Thanks for the quick response!
>
> On Thu, Jun 29, 2023 at 10:47 AM Antoine Pitrou 
> wrote:
>
> >
> > Hi Li,
> >
> > There is not currently, but it would probably be a useful small utility.
> > If you look for `std::exception` in the codebase, you'll find that there
> > a couple of places where we turn it into a Status already.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/06/2023 à 16:20, Li Jin a écrit :
> > > Hi,
> > >
> > > IIUC, most of the Arrow C++ code doesn't not use exceptions. My
> question
> > is
> > > are there some Arrow utility / macro that wrap the function/code that
> > might
> > > raise an exception and turn that into code that returns an arrow error
> > > Status?
> > >
> > > Thanks!
> > > Li
> > >
> >
>

Re: Question about nested columnar validity

2023-06-29 Thread Weston Pace

>> 2. For StringView and ArrayView, if the parent has `validity = false`.
>>  If they have `validity = true`, there offset might point to a
invalid
>>  position

>I have no idea, but I hope not. Ben Kietzman might want to answer more
>precisely here.

I think, for view arrays, the offsets & lengths are indeed "unspecified".
Imagine we are creating a view array programmatically from two child arrays
of starts and lengths and one of those child arrays has a null in it so we
propagate that and create a null view.  We would set the validity bit to
0.  We could probably mandate the "length == 0" if needed but I'm not sure
there is anything that would make sense to require for the "offset". It
would seem the simplest thing to do is allow it to be unspecified.  Then
the entire operation can be:

 * Set validity to lengths.validity & offsets.validity
 * Set lengths to lengths.values
 * Set offsets to offsets.values

On Thu, Jun 29, 2023 at 6:51 AM Antoine Pitrou  wrote:

>
> Le 29/06/2023 à 15:16, wish maple a écrit :
> > Sorry for being misleading. "valid" offset means that:
> > 1. For Binary Like [1] format, and List formats [2], even if the parent
> >  has `validity = false`. Their offset should be well-defined.
>
> Yes.
>
> > 2. For StringView and ArrayView, if the parent has `validity = false`.
> >  If they have `validity = true`, there offset might point to a
> invalid
> >  position
>
> I have no idea, but I hope not. Ben Kietzman might want to answer more
> precisely here.
>
> Regards
>
> Antoine.
>

Re: Question about nested columnar validity

2023-06-28 Thread Weston Pace

I agree with Antoine but I get easily confused by "valid, as in
structurally correct" and "valid, as in not null" so I want to make sure I
understand:

> The child of a nested
> array should be valid itself, independently of the parent's validity
bitmap.

A child must be "structurally correct" (e.g. validate correctly) but it can
have null values.

For example, given:
  * we have an array of type struct<{list}>
  * the third element of the parent struct array is null
Then
 * The third element in the list array could be null
 * The third element in the list array could be valid (not-null)
 * In all cases the offsets must exist for the third element in the list
array

> If it's a BinaryArray, when when it parent is not valid, would a validity
> member point to a undefined address?
>
> And if it's ListArray[3], when it parent is not valid, should it offset
and
> size be valid?

When a binary array or a list array element is null the cleanest thing to
do is to set the offsets to be the same.  So, for example, given a list
array with 5 elements, if second item is null, the offsets could be 0, 8,
8, 12, 20, 50.

Question for Antoine: is it correct for the offsets of a null element to be
different? For the above example, could the offsets be 0, 8, 10, 12, 20,
50?  I think the answer is "yes, this is correct".  However, if this is the
case, then the list's values array must still have 50 items.

On Wed, Jun 28, 2023 at 8:18 AM Antoine Pitrou  wrote:

>
> Hi!
>
> Le 28/06/2023 à 17:03, wish maple a écrit :
> > Hi,
> >
> > By looking at the arrow standard, when it comes to nested structure, like
> > StructArray[1] or FixedListArray[2], when parent is not valid, the
> > correspond child leaves "undefined".
> >
> > If it's a BinaryArray, when when it parent is not valid, would a validity
> > member point to a undefined address?
> >
> > And if it's ListArray[3], when it parent is not valid, should it offset
> and
> > size be valid?
>
> Offsets and sizes should always need to be valid. The child of a nested
> array should be valid itself, independently of the parent's validity
> bitmap.
>
> You can also check this with ValidateFull(): it should catch such problems.
>
> Regards
>
> Antoine.
>

Re: Enabling apache/arrow GitHub dependency graph with vcpkg

2023-06-28 Thread Weston Pace

Thanks for reaching out.  This sounds like a useful tool and I'm happy to
hear about more development around establishing supply chain awareness.
However, Arrow is an Apache Software Project and, as such, we don't manage
all of the details of our Github repository.  Some of these (including, I
believe, selection of integrations) are managed by the ASF infrastructure
team[1].

We can contact them to request this integration, but if your interest is
primarily in getting feedback around the setup and configuration of
integration, then I'm not sure we'd be very helpful as the process would be
pretty opaque to us.  You may instead want to contact the Infra team
directly.

[1] https://infra.apache.org/



On Tue, Jun 27, 2023 at 1:57 PM Michael Price
 wrote:

> Hello Apache Arrow project,
>
>
>
> The Microsoft C++ team has been working with our partners at GitHub to
> improve the C and C++ user experience on their platform. As a part of that
> effort, we have added vcpkg support for the GitHub dependency graph
> feature. We are looking for feedback from GitHub repositories, like
> apache/arrow, that are using vcpkg so we can identify improvements to this
> new feature.
>
>
>
> Enabling this feature for your repositories brings a number of benefits,
> now and in the future:
>
>
>
>   *   Visibility - Users can easily see which packages you depend on and
> their versions. This includes transitive dependencies not listed in your
> vcpkg.json manifest file.
>   *   Compliance - Generate an SBOM from GitHub that includes C and C++
> dependencies as well as other supported ecosystems.
>   *   Networking - A fully functional dependency graph allows you to not
> only see your dependencies, but also other GitHub projects that depend on
> you, letting you get an idea of how many people depend on your efforts. We
> want to hear from you if we should prioritize enabling this.
>   *   Security - The intention is to enable GitHub's secure supply chain
> features<
> https://docs.github.com/code-security/supply-chain-security/understanding-your-software-supply-chain/about-supply-chain-security>.
> Those features are not available yet, but when they are, you'll already be
> ready to use them on day one.
>
>
>
> What's Involved?
>
>
>
> If you decide to help us out, here's how that would look:
>
>   *   Enable the integration following our documentation. See GitHub
> integrations - The GitHub dependency graph<
> https://aka.ms/vcpkg-dependency-graph> more information.
>   *   Send us a follow-up email letting us know if the documentation
> worked and was clear, and what missing functionality is most important to
> you.
>   *   If you have problem enabling the integration, we'll work directly
> with you to resolve your issue.
>   *   We will schedule a brief follow-up call (15-20) with you after the
> feature is enabled to discuss your feedback.
>   *   When we make improvements, we'd like you to try them out to let us
> know if we are solving the important problems.
>   *   Eventually, we'd like to get a "thumbs up" or "thumbs down" on
> whether or not you think the feature is complete enough to no longer be an
> experiment.
>   *   We'll credit you for your help when we make the move out of
> experimental and blog about the transition to fully supported.
>
>
>
> If you are interested in collaborating with us, let us know by replying to
> this email.
>
>
>
> Thanks,
>
>
>
> Michael Price
> Product Manager, Microsoft C++ Team
>
>
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Weston Pace

> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.

In my mind there is a distinction between the "compute domain" (e.g. a
pandas dataframe or something like ibis or SQL) and the "data domain" (e.g.
pyarrow datasets).  I think, in a perfect world, you could push any and all
compute up and down the chain as far as possible.  However, in practice, I
think there is a healthy set of tools and libraries that say "simple column
projection and filtering is good enough".  I would argue that there is room
for both APIs and while the temptation is always present to "shove as much
compute as you can" I think pyarrow datasets seem to have found a balance
between the two that users like.

So I would argue that this protocol may never become a general-purpose
unmaterialized dataframe and that isn't necessarily a bad thing.

> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.

Just to clarify, the proposal currently only requires the fragments to be
serializable correct?

On Fri, Jun 23, 2023 at 11:48 AM Will Jones  wrote:

> Thanks Ian for your extensive feedback.
>
> I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions.
> >
>
> I would agree with this point, if we were starting from scratch. But one of
> my goals is for this protocol to be descriptive of the existing dataset
> integrations in the ecosystem, which all currently rely on PyArrow
> expressions. For example, you'll notice in the PR that there are unit tests
> to verify the current PyArrow Dataset classes conform to this protocol,
> without changes.
>
> I think there's three routes we can go here:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> Substrait-based alternatives we deprecate the PyArrow expression support.
> This is what I intended with the current design, and I think it provides
> the most obvious migration paths for existing producers and consumers.
> 2. We keep the overall dataset API, but don't introduce the filter and
> projection arguments until we have Substrait support. I'm not sure what the
> migration path looks like for producers and consumers, but I think this
> just implicitly becomes the same as (1), but with worse documentation.
> 3. We write a protocol completely from scratch, that doesn't try to
> describe the existing dataset API. Producers and consumers would then
> migrate to use the new protocol and deprecate their existing dataset
> integrations. We could introduce a dunder method in that API (sort of like
> __arrow_array__) that would make the migration seamless from the end-user
> perspective.
>
> *Which do you all think is the best path forward?*
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset.
> >
>
> This is a good point. I can add a section describing the differences. The
> main ones I can think of are that: (1) Datasets are "pruneable": one can
> select a subset of columns and apply a filter on rows to avoid IO and (2)
> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.
>
> Best,
>
> Will Jones
>
> On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:
>
> > Thanks Will for this proposal!
> >
> > For anyone familiar with PyArrow, this idea has a clear intuitive
> > logic to it. It provides an expedient solution to the current lack of
> > a practical means for interchanging "unmaterialized dataframes"
> > between different Python libraries.
> >
> > To elaborate on that

Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Weston Pace

Congrats Dewey!

On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:

>
> Welcome to the PMC Dewey!
>
>
> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
> > Congrats Dewey!
> >
> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
> >  wrote:
> >>
> >> Well deserved! Congratulations Dewey!
> >>
> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> >>> wrote:
> 
>  Congrats Dewey!!
> 
>  On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
> 
>  wrote:
> 
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
> wrote:
> >
> >> Well-deserved Dewey, congratulations!
> >>
> >> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  >
> >> wrote:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> >>> ale...@voltrondata.com
> >>> .invalid>
> >>> wrote:
> >>>
>  Congratulations Dewey!! 🎉
> 
>  On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > raulcumpl...@gmail.com
> >>>
>  wrote:
> 
> > Congratulations Dewey!
> >
> > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> >>> escribió:
> >
> >> The Project Management Committee (PMC) for Apache Arrow has
> > invited
> >> Dewey Dunnington (paleolimbot) to become a PMC member and we
> >>> are
>  pleased
> > to
> >> announce
> >> that Dewey Dunnington has accepted.
> >>
> >> Congratulations and welcome!
> >>
> >
> 
> >>>
> >>
> >
> >>>
>

Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Weston Pace

One small difference seems to be that Close is idempotent and Cancel is not.

> void cancel()
>  throws SQLException
>
> Cancels this Statement object if both the DBMS and driver support
aborting an SQL statement. This method can be used by one thread to cancel
a statement that is being executed by another thread.
>
> Throws:
> SQLException - if a database access error occurs or this method is
called on a closed Statement

In other words, with cancel, you can display an error to the user if the
statement is already finished (and thus was not able to be canceled).
However, I don't know if that is significant at all.

On Fri, Jun 23, 2023 at 12:17 AM Sutou Kouhei  wrote:

> Hi,
>
> Thanks for sharing your thoughts.
>
> OK. I'll change the current specifications/implementations
> to the followings:
>
> * Remove CloseFlightInfo (if nobody objects it)
> * RefreshFlightEndpoint ->
>   RenewFlightEndpoint
> * RenewFlightEndpoint(FlightEndpoint) ->
>   RenewFlightEndpoint(RenewFlightEndpointRequest)
> * CancelFlightInfo(FlightInfo) ->
>   CancelFlightInfo(CancelFlightInfoRequest)
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22
> Jun 2023 12:51:55 -0400,
>   Matt Topol  wrote:
>
> >> That said, I think it's reasonable to only have Cancel at the protocol
> > level.
> >
> > I'd be in favor of only having Cancel too. In theory calling Cancel on
> > something that has already completed should just be equivalent to calling
> > Close anyways rather than requiring a client to guess and call Close if
> > Cancel errors or something.
> >
> >> So this may not be needed for now. How about accepting a
> >> specific request message instead of FlightEndpoint directly
> >> as "PersistFlightEndpoint" input?
> >
> > I'm also in favor of this.
> >
> >> I think Refresh was fine, but if there's confusion, I like Kou's
> > suggestion of Renew the best.
> >
> > I'm in the same boat as David here, I think Refresh was fine but like the
> > suggestion of Renew best if we want to avoid any confusion.
> >
> >
> >
> > On Thu, Jun 22, 2023 at 2:55 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Doesn't protobuf ensure forwards compatibility? Why would it break?
> >>
> >> At worse, you can include the changes necessary for it to compile
> >> cleanly, without adding support for the new fields/methods?
> >>
> >>
> >> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
> >> > Hi,
> >> >
> >> > The following part in the original e-mail is the one:
> >> >
> >> >> https://github.com/apache/arrow/pull/36009 is an
> >> >> implementation of this proposal. The pull requests has the
> >> >> followings:
> >> >>
> >> >> 1. Format changes:
> >> >> * format/Flight.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
> >> >> * format/FlightSql.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
> >> >>
> >> >> 2. Documentation changes:
> >> >> docs/source/format/Flight.rst
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
> >> >
> >> > We can split the part to a separated pull request. But if we
> >> > split the part and merge the pull requests for format
> >> > related changes and implementation related changes
> >> > separately, our CI will be broken temporary. Because our
> >> > implementations use auto-generated sources that are based on
> >> > *.proto.
> >> >
> >> >
> >> > Thanks,
> >>
>

Re: Question about `minibatch`

2023-06-20 Thread Weston Pace

Those goals are somewhat compatible.  Sasha can probably correct me if I
get this wrong but my understanding is that the minibatch is just large
enough to ensure reliable vectorized execution.  It is used in some
innermost critical sections to both keep the working set small (fit in L1)
and allocation should be avoided.

In addition to ensuring things fit in L1 there is also, I believe, a side
benefit of using small loops to increase the chances of encountering
special cases (e.g. all values null or no values null) which can sometimes
save you from more complex logic.

On Tue, Jun 20, 2023 at 7:32 PM Ruoxi Sun  wrote:

> Hi,
>
> By looking at acero code, I'm curious about the concept `minibatch` being
> used in swiss join and grouper.
> I wonder if its purpose is to proactively limit the memory size of the
> working set? Or is it the consequence of that the temp vector should be
> fix-sized (to avoid costly memory allocation)? Additionally, what's the
> impact of choosing the size of the minibatch?
>
> Really appreciate if someone can help me to clear this.
>
> Thanks.
>
> *Rossi*
>

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-20 Thread Weston Pace

Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements.

Compared to some of the other layouts we've discussed recently (REE, list
veiw) I do think this layout is more unique and fundamentally different.
Perhaps most fundamentally different:

 * This is the first layout where the number of buffers depends on the data
and not the schema.  I think this is the most architecturally significant
fact.  It does require a (backwards compatible) change to the IPC format
itself, beyond just adding new type codes.  It also poses challenges in
places where we've assumed there will be at most 3 buffers (e.g. in
ArraySpan, though, as you have shown, we can work around this using a raw
pointers representation internally in those spots).

I think you've done some great work to integrate this well with Arrow-C++
and I'm convinced it can work.

I would be interested in hearing some input from the Rust community.

Ben, at one point there was some discussion that this might be a c-data
only type.  However, I believe that was based on the raw pointers
representation.  What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)

On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman 
wrote:

> Hi Gang,
>
> I'm not sure what you mean, sorry if my answers are off base:
>
> Parquet's ByteArray will be unaffected by the addition of the string view
> type;
> all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
> with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> during serialization to parquet [1].
>
> If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
> as encoding of equivalent arrow::Type::STRING, that's something I haven't
> benchmarked so I can't answer definitively. I would expect it to be faster
> than
> first converting STRING_VIEW->STRING then encoding to parquet; direct
> encoding avoids allocating and populating temporary buffers. Of course this
> only applies to cases where you need to encode an array of STRING_VIEW to
> parquet- encoding of STRING to parquet will be unaffected.
>
> Sincerely,
> Ben
>
> [1]
>
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
>
> On Thu, Jun 15, 2023 at 10:34 PM Gang Wu  wrote:
>
> > Hi Ben,
> >
> > The posted benchmark [1] looks pretty good to me. However, I want to
> > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> > uses a customized parquet::ByteArray type [2] for string/binary, I would
> > expect some regression of conversions between parquet reader/writer
> > and the proposed string view array, especially when some strings use
> > short form and others use long form.
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 16, 2023 at 3:58 AM Will Jones 
> > wrote:
> >
> > > Cool. Thanks for doing that!
> > >
> > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman 
> > > wrote:
> > >
> > > > I've added https://github.com/apache/arrow/issues/36112 to track
> > > > deduplication of buffers on write.
> > > > I don't think it would require modification of the IPC format.
> > > >
> > > > Ben
> > > >
> > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol 
> > > wrote:
> > > >
> > > > > Based on my understanding, in theory a buffer *could* be shared
> > within
> > > a
> > > > > batch since the flatbuffers message just uses an offset and length
> to
> > > > > identify the buffers.
> > > > >
> > > > > That said, I don't believe any current implementation actually does
> > > this
> > > > or
> > > > > takes advantage of this in any meaningful way.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones <
> will.jones...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ben,
> > > > > >
> > > > > > It's exciting to see this move along.
> > > > > >
> > > > > > The buffers will be duplicated. If buffer duplication is becomes
> a
> > > > > concern,
> > > > > > > I'd prefer to handle
> > > > > > > that in the ipc writer. Then buffers which are duplicated could
> > be
> > > > > > detected
> > > > > > > by checking
> > > > > > > pointer identity and written only once.
> > > > > >
> > > > > >
> > > > > > Question: to be able to write buffer only once and reference in
> > > > multiple
> > > > > > arrays, does that require a change to the IPC format? Or is
> sharing
> > > > > buffers
> > > > > > within the same batch already allowed in the IPC format?
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Will Jones
> > > > > >
> > > > > > On Thu, Jun 15, 2023 at 9:03 AM Benja

Re: [ANNOUNCE] New Arrow PMC member: Ben Baumgold,

2023-06-20 Thread Weston Pace

Congratulations Ben!

On Tue, Jun 20, 2023 at 7:38 AM Jacob Quinn  wrote:

> Yay! Congrats Ben! Love to see more Julia folks here!
>
> -Jacob
>
> On Tue, Jun 20, 2023 at 4:15 AM Andrew Lamb  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Ben Baumgold, to become a PMC member and we are pleased to announce
> > that Ben Baumgold has accepted.
> >
> > Congratulations and welcome!
> >
>

Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Weston Pace

Note that you can ask pyarrow how much memory it thinks it is using with
the pyarrow.total_allocated_bytes[1] function.  This can be very useful for
tracking memory leaks.

I see that memory-profiler now has support for different backends. Sadly,
it doesn't look like you can register a custom backend.  Might be a fun
project if someone wanted to add a pyarrow backend for it :)

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.total_allocated_bytes.html

On Thu, Jun 15, 2023 at 9:16 AM Antoine Pitrou  wrote:

>
> Hi Alex,
>
> I think you're misinterpreting the results. Yes, the RSS memory (as
> reported by memory_profiler) doesn't seem to decrease. No, it doesn't
> mean that Arrow doesn't release memory. It's actually common for memory
> allocators (such as jemalloc, or the system allocator) to keep
> deallocated pages around, because asking the kernel to recycle them is
> expensive.
>
> Unless your system is running low on memory, you shouldn't care about
> this. Trying to return memory to the kernel can actually make
> performance worse if you ask Arrow to allocate memory soon after.
>
> That said, you can try to call MemoryPool.release_unused() if these
> numbers are important to you:
>
> https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused
>
> Regards
>
> Antoine.
>
>
>
> Le 15/06/2023 à 17:39, Jerald Alex a écrit :
> > Hi Experts,
> >
> > I have come across the memory pool configurations using an environment
> > variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and
> > test it.
> >
> > I could observe improvements on macOS with the *system* memory pool but
> no
> > change on linux os. I have captured more details on GH issue
> > https://github.com/apache/arrow/issues/36100... If any one can
> highlight or
> > suggest a way to overcome this problem will be helpful. Appreciate your
> > help.!
> >
> > Regards,
> > Alex
> >
> > On Wed, Jun 14, 2023 at 9:35 PM Jerald Alex  wrote:
> >
> >> Hi Experts,
> >>
> >> Pyarrow *Table.from_pylist* does not release memory until the program
> >> terminates. I created a sample script to highlight the issue. I have
> also
> >> tried setting up `pa.jemalloc_set_decay_ms(0)` but it didn't help much.
> >> Could you please check this and let me know if there are potential
> issues /
> >> any workaround to resolve this?
> >>
> > pyarrow.__version__
> >> '12.0.0'
> >>
> >> OS Details:
> >> OS: macOS 13.4 (22F66)
> >> Kernel Version: Darwin 22.5.0
> >>
> >>
> >>
> >> Sample code to reproduce. (it needs memory_profiler)
> >>
> >> #file_name: test_exec.py
> >> import pyarrow as pa
> >> import time
> >> import random
> >> import string
> >>
> >> from memory_profiler import profile
> >>
> >> def get_sample_data():
> >>  record1 = {}
> >>  for col_id in range(15):
> >>  record1[f"column_{col_id}"] = string.ascii_letters[10 :
> >> random.randint(17, 49)]
> >>
> >>  return [record1]
> >>
> >> def construct_data(data):
> >>  count = 1
> >>  while count < 10:
> >>  pa.Table.from_pylist(data * 10)
> >>  count += 1
> >>  return True
> >>
> >> @profile
> >> def main():
> >>  data = get_sample_data()
> >>  construct_data(data)
> >>  print("construct data completed!")
> >>
> >> if __name__ == "__main__":
> >>  main()
> >>  time.sleep(600)
> >>
> >>
> >> memory_profiler output:
> >>
> >> Filename: test_exec.py
> >>
> >> Line #Mem usageIncrement  Occurrences   Line Contents
> >> =
> >>  41 65.6 MiB 65.6 MiB   1   @profile
> >>  42 def main():
> >>  43 65.6 MiB  0.0 MiB   1   data =
> get_sample_data()
> >>  44203.8 MiB138.2 MiB   1   construct_data(data)
> >>  45203.8 MiB  0.0 MiB   1   print("construct
> data
> >> completed!")
> >>
> >> Regards,
> >> Alex
> >>
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace

> Can't implementations add support as needed? I assume that the "depending
on what support [it] aspires to" implies this, but if a feature isn't used
in a community then it can leave it unimplemented. On the flip side, if it
is used in a community (e.g. C++) is there no way to upstream it without
the support of every community?

I think that is something that is more tolerable for something like REE or
dictionary support which is purely additive (e.g. JS and C# don't support
unions yet and can get around to it when it is important).

The challenge for this kind of "alternative layout" is that you start to
get a situation where some implementations choose "option A" and others
choose "option B" and it's not clearly a case of "this is a feature we
haven't added support for yet".

On Wed, Jun 14, 2023 at 2:01 PM Antoine Pitrou  wrote:

>
> So each community would have its own version of the Arrow format?
>
>
> Le 14/06/2023 à 22:47, Aldrin a écrit :
> >  > Arrow has at least 7 native "official" implementations... 5 bindings
> > on C++... and likely other implementations (like arrow2 in rust)
> >
> >> I think it is worth remembering that depending on what level of support
> > ListView aspires to, such an addition could require non trivial changes
> to
> > many / all of those implementations (and the APIs they expose).
> >
> > Can't implementations add support as needed? I assume that the
> > "depending on what support [it] aspires to" implies this, but if a
> > feature isn't used in a community then it can leave it unimplemented. On
> > the flip side, if it is used in a community (e.g. C++) is there no way
> > to upstream it without the support of every community?
> >
> >
> >
> > Sent from Proton Mail for iOS
> >
> >
> > On Wed, Jun 14, 2023 at 13:06, Raphael Taylor-Davies
> > mailto:On Wed, Jun 14, 2023 at
> > 13:06, Raphael Taylor-Davies <> wrote:
> >> Even something relatively straightforward becomes a huge implementation
> >> effort when multiplied by a large number of codebases, users and
> >> datasets. Parquet is a great source of historical examples of the
> >> challenges of incremental changes that don't meaningfully unlock new
> >> use-cases. To take just one, Int96 was deprecated almost a decade ago,
> >> in favour of some additional metadata over an existing physical layout,
> >> and yet Int96 is still to the best of my knowledge used by Spark by
> >> default.
> >>
> >> That's not to say that I think the arrow specification should ossify and
> >> we should never change it, but I'm not hugely enthusiastic about adding
> >> encodings that are only incremental improvements over existing
> encodings.
> >>
> >> I therefore wonder if there are some new use-cases I am missing that
> >> would be unlocked by this change, and that wouldn't be supported by the
> >> dictionary proposal? Perhaps you could elaborate here? Whilst I do agree
> >> using dictionaries as proposed is perhaps a less elegant solution, I
> >> don't see anything inherently wrong with it, and if it ain't broke we
> >> really shouldn't be trying to fix it.
> >>
> >> Kind Regards,
> >>
> >> Raphael Taylor-Davies
> >>
> >> On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho
> >>  wrote:
> >>
> >> General approach to alternative formats aside, in the specific case
> >> of ListView, I think the implementation complexity is being
> >> overestimated in these discussions. The C++ Arrow implementation
> >> shares a lot of code between List and LargeList. And with some
> >> tweaks, I'm able to share that common infrastructure for ListView as
> >> well. [1] ListView is similar to list: it doesn't require offsets to
> >> be sorted and adds an extra buffer containing sizes. For symmetry
> >> with the List and LargeList types (FixedSizeList not included), I'm
> >> going to propose we add a LargeListView. That is not part of the
> >> draft implementation yet, but seems like an obvious thing to have
> >> now that I implemented the `if_else` specialization. [2] David Li
> >> asked about this above and I can confirm now that 64-bit version of
> >> ListView (LargeListView) is in the plans. Trying to avoid
> >> re-implementing some kernels is not a good goal to chase, IMO,
> >> because kernels need tweaks to take advantage of the format. [1]
> >> https://github.

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace

; It seems to me that most projects that are implementing Arrow today
> >> are not aiming to provide complete coverage of Arrow; rather they are
> >> adopting Arrow because of its role as a standard and they are
> >> implementing only as much of the Arrow standard as they require to
> >> achieve some goal. I believe that such projects are important Arrow
> >> stakeholders, and I believe that this proposed notion of canonical
> >> alternative layouts will serve them well and will create efficiencies
> >> by standardizing implementations around a shared set of alternatives.
> >>
> >> However I think that the documentation for canonical alternative
> >> layouts should strongly encourage implementers to default to using the
> >> primary layouts defined in the core spec and only use alternative
> >> layouts in cases where the primary layouts do not meet their needs.
> >>
> >>
> >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield 
> >> wrote:
> >>>
> >>> This sounds reasonable to me but my main concern is, I'm not sure there
> >> is
> >>> a great mechanism to enforce canonical layouts don't somehow become
> >> default
> >>> (or the only implementation).
> >>>
> >>> Even for these new layouts, I think it might be worth rethinking
> binding
> >> a
> >>> layout into the schema versus having a different concept of encoding
> (and
> >>> changing some of the corresponding data structures).
> >>>
> >>>
> >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace 
> >> wrote:
> >>>
> >>>> Trying to settle on one option is a fruitless endeavor.  Each type has
> >> pros
> >>>> and cons.  I would also predict that the largest existing usage of
> >> Arrow is
> >>>> shuttling data from one system to another.  The newly proposed format
> >>>> doesn't appear to have any significant advantage for that use case (if
> >>>> anything, the existing format is arguably better as it is more
> >> compact).
> >>>>
> >>>> I am very biased towards historical precedent and avoiding breaking
> >>>> changes.
> >>>>
> >>>> We have "canonical extension types", perhaps it is time for "canonical
> >>>> alternative layouts".  We could define it as such:
> >>>>
> >>>>   * There are one or more primary layouts
> >>>> * Existing layouts are automatically considered primary layouts,
> >> even if
> >>>> they wouldn't
> >>>>   have been primary layouts initially (e.g. large list)
> >>>>   * A new layout, if it is semantically equivalent to another, is
> >> considered
> >>>> an alternative layout
> >>>>   * An alternative layout still has the same requirements for adoption
> >> (two
> >>>> implementations and a vote)
> >>>> * An implementation should not feel pressured to rush and
> implement
> >> the
> >>>> new layout.
> >>>>   It would be good if they contribute in the discussion and
> >> consider the
> >>>> layout and vote
> >>>>   if they feel it would be an acceptable design.
> >>>>   * We can define and vote and approve as many canonical alternative
> >> layouts
> >>>> as we want:
> >>>> * A canonical alternative layout should, at a minimum, have some
> >>>>   reasonable justification, such as improved performance for
> >> algorithm X
> >>>>   * Arrow implementations MUST support the primary layouts
> >>>>   * An Arrow implementation MAY support a canonical alternative,
> >> however:
> >>>> * An Arrow implementation MUST first support the primary layout
> >>>> * An Arrow implementation MUST support conversion to/from the
> >> primary
> >>>> and canonical layout
> >>>> * An Arrow implementation's APIs MUST only provide data in the
> >>>> alternative
> >>>>   layout if it is explicitly asked for (e.g. schema inference
> should
> >>>> prefer the primary layout).
> >>>>   * We can still vote for new primary layouts (e.g. promoting a
> >> canonical
> >>>> alternative) but, in these
> >>>>  votes we don't o

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Weston Pace

Are you looking for something in C++ or python?  We have a thing called the
"grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which
(if memory serves) is the heart of the functionality in C++.  It would be
nice to add some python bindings for this functionality as this ask comes
up from pyarrow users pretty regularly.

The grouper is used in src/arrow/dataset/partition.h to partition a record
batch into groups of batches.  This is how the dataset writer writes a
partitioned dataset.  It's a good example of how you would use the grouper
for a "one batch in, one batch per group out" use case.

The grouper can also be used in a streaming situation (many batches in, one
batch per group out).  In fact, the grouper is what is used by the group by
node.  I know you recently added [1] and I'm maybe a little uncertain what
the difference is between this ask and the capabilities added in [1].

[1] https://github.com/apache/arrow/pull/35514

On Tue, Jun 13, 2023 at 8:23 AM Li Jin  wrote:

> Hi,
>
> I am trying to write a function that takes a stream of record batches
> (where the last column is group id), and produces k record batches, where
> record batches k_i contain all the rows with group id == i.
>
> Pseudocode is sth like:
>
> def group_rows(batches, k) -> array[RecordBatch] {
>   builders = array[RecordBatchBuilder](k)
>   for batch in batches:
># Assuming last column is the group id
>group_ids = batch.column(-1)
>for i in batch.num_rows():
> k_i = group_ids[i]
> builders[k_i].append(batch[i])
>
>batches = array[RecordBatch](k)
>for i in range(k):
>batches[i] = builders[i].build()
>return batches
> }
>
> I wonder if there is some existing code that does something like this?
> (Specially I didn't find code that can append row/rows to a
> RecordBatchBuilder (either one row given an row index, or multiple rows
> given a list of row indices)
>

Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread Weston Pace

Congratulations

On Tue, Jun 13, 2023, 1:28 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Congratulations!
>
> On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido 
> wrote:
> >
> > Congratulations Jie!!!
> >
> > El lun, 12 jun 2023, 20:35, Matt Topol 
> escribió:
> >
> > > Congrats Jie!
> > >
> > > On Sun, Jun 11, 2023 at 9:20 AM Andrew Lamb 
> wrote:
> > >
> > > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > > Jie Wen to become a PMC member and we are pleased to announce
> > > > that Jie Wen has accepted.
> > > >
> > > > Congratulations and welcome!
> > > >
> > >
>

Re: [Python] Dataset scanner fragment skip options.

2023-06-12 Thread Weston Pace

> I would like to know if it is possible to skip the specific set of
batches,
> for example, the first 10 batches and read from the 11th Batch.

This sort of API does not exist today.  You can skip files by making a
smaller dataset with fewer files (and I think, with parquet, there may even
be a way to skip row groups by creating a fragment per row group with
ParquetFileFragment).  However, there is no existing datasets API for
skipping batches or rows.

> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?

fragment_scan_options is the spot for configuring format-specific scan
options.  For example, with parquet, you often don't need to bother with
this and can just use the defaults (I can't remember if nullptr is fine or
if you need to set this to FileFormat::default_fragment_scan_options but I
would hope it's ok to just use nullptr.

On the other hand, formats like CSV tend to need more configuration and
tuning.  For example, setting the delimiter, skipping some header rows,
etc.  Parquet is pretty self-describing and you would only need to use the
fragment_scan_options if, for example, you need to decryption or custom
control over which columns are encoded as dictionary, etc.

On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex  wrote:

> Hi Experts,
>
> I have been using dataset.scanner to read the data with specific filter
> conditions and batch_size of 1000 to read the data.
>
> ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches()
>
> I would like to know if it is possible to skip the specific set of batches,
> for example, the first 10 batches and read from the 11th Batch.
>
>
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner
> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?
>
> Really appreciate any input. thanks!
>
> Regards,
> Alex
>

Re: [ANNOUNCE] New Arrow committer: Mehmet Ozan Kabak

2023-06-08 Thread Weston Pace

Congratulations!

On Thu, Jun 8, 2023, 5:36 PM Mehmet Ozan Kabak  wrote:

> Thanks everybody. Looking to collaborate further!
>
> > On Jun 8, 2023, at 9:52 AM, Matt Topol  wrote:
> >
> > Congrats! Welcome Ozan!
> >
> > On Thu, Jun 8, 2023 at 8:53 AM Raúl Cumplido 
> wrote:
> >
> >> Congratulations and welcome!
> >>
> >> El jue, 8 jun 2023 a las 14:45, Metehan Yıldırım
> >> () escribió:
> >>>
> >>> Congrats Ozan!
> >>>
> >>> On Thu, Jun 8, 2023 at 1:09 PM Andrew Lamb 
> wrote:
> >>>
>  On behalf of the Arrow PMC, I'm happy to announce that  Mehmet Ozan
> >> Kabak
>  has accepted an invitation to become a committer on Apache
>  Arrow. Welcome, and thank you for your contributions!
> 
>  Andrew
> 
> >>
>
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace

> This implies that each canonical alternative layout would codify a
> primary layout as its "fallback."

Yes, that was part of my proposal:

>  * A new layout, if it is semantically equivalent to another, is
considered an alternative layout

Or, to phrase it another way.  If there is not a "fallback" then it is not
an alternative layout.  It's a brand new primary layout.  I'd expect this
to be quite rare.  I can't really even hypothesize any examples.  I think
the only truly atomic layouts are fixed-width, list, and struct.

> This seems reasonable but it opens
> up some cans of worms, such as how two components communicating
> through an Arrow interface would negotiate which layout is supported

Most APIs that I'm aware of already do this.  For example,
pyarrow.parquet.read_table has a "read_dictionary" property that can be
used to control whether or not a column is returned with the dictionary
encoding.  There is no way (that I'm aware of) to get a column in REE
encoding today without explicitly requesting it.  In fact, this could be as
simple as a boolean "use_advanced_features" flag although I would
discourage something so simplistic.  The point is that arrow-compatible
software should, by default, emit types that are supported by all arrow
implementations.

Of course, there is no way to enforce this, it's just a guideline / strong
recommendation on how software should behave if it wants to state "arrow
compatible" as a feature.

On Tue, Jun 6, 2023 at 3:33 PM Ian Cook  wrote:

> Thanks Weston. That all sounds reasonable to me.
>
> >  with the caveat that the primary layout must be emitted if the user
> does not specifically request the alternative layout
>
> This implies that each canonical alternative layout would codify a
> primary layout as its "fallback." This seems reasonable but it opens
> up some cans of worms, such as how two components communicating
> through an Arrow interface would negotiate which layout is supported.
> I suppose such details should be discussed in a separate thread, but I
> raise this here just to point out that it implies an expansion in the
> scope of what Arrow interfaces can do.
>
> On Tue, Jun 6, 2023 at 6:17 PM Weston Pace  wrote:
> >
> > From Micah:
> >
> > > This sounds reasonable to me but my main concern is, I'm not sure
> there is
> > > a great mechanism to enforce canonical layouts don't somehow become
> > default
> > > (or the only implementation).
> >
> > I'm not sure I understand.  Is the concern that an alternative layout is
> > eventually
> > used more and more by implementations until it is used more often than
> the
> > primary
> > layouts?  In that case I think that is ok and we can promote the
> alternative
> > to a primary layout.
> >
> > Or is the concern that some applications will only support the
> alternative
> > layouts and
> > not the primary layout?  In that case I would argue the application is
> not
> > "arrow compatible".
> > I don't know that we prevent or enforce this today either.  An author can
> > always falsely
> > claim they support Arrow even if they are using their own bespoke format.
> >
> > From Ian:
> >
> > > It seems to me that most projects that are implementing Arrow today
> > > are not aiming to provide complete coverage of Arrow; rather they are
> > > adopting Arrow because of its role as a standard and they are
> > > implementing only as much of the Arrow standard as they require to
> > > achieve some goal. I believe that such projects are important Arrow
> > > stakeholders, and I believe that this proposed notion of canonical
> > > alternative layouts will serve them well and will create efficiencies
> > > by standardizing implementations around a shared set of alternatives.
> > >
> > > However I think that the documentation for canonical alternative
> > > layouts should strongly encourage implementers to default to using the
> > > primary layouts defined in the core spec and only use alternative
> > > layouts in cases where the primary layouts do not meet their needs.
> >
> > I'd maybe take a slightly harsher stance.  I don't think an application
> > needs to
> > support all types.  For example, an Arrow-native string processing
> library
> > might
> > not worry about the integer types.  That would be fine.  I think it would
> > still
> > be fair to call it an "arrow compatible string processing library".
> >
> > However, an application must support primary layouts in ad

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace

onical layouts don't somehow become
> default
> > (or the only implementation).
> >
> > Even for these new layouts, I think it might be worth rethinking binding
> a
> > layout into the schema versus having a different concept of encoding (and
> > changing some of the corresponding data structures).
> >
> >
> > On Mon, May 22, 2023 at 10:37 AM Weston Pace 
> wrote:
> >
> > > Trying to settle on one option is a fruitless endeavor.  Each type has
> pros
> > > and cons.  I would also predict that the largest existing usage of
> Arrow is
> > > shuttling data from one system to another.  The newly proposed format
> > > doesn't appear to have any significant advantage for that use case (if
> > > anything, the existing format is arguably better as it is more
> compact).
> > >
> > > I am very biased towards historical precedent and avoiding breaking
> > > changes.
> > >
> > > We have "canonical extension types", perhaps it is time for "canonical
> > > alternative layouts".  We could define it as such:
> > >
> > >  * There are one or more primary layouts
> > >* Existing layouts are automatically considered primary layouts,
> even if
> > > they wouldn't
> > >  have been primary layouts initially (e.g. large list)
> > >  * A new layout, if it is semantically equivalent to another, is
> considered
> > > an alternative layout
> > >  * An alternative layout still has the same requirements for adoption
> (two
> > > implementations and a vote)
> > >* An implementation should not feel pressured to rush and implement
> the
> > > new layout.
> > >  It would be good if they contribute in the discussion and
> consider the
> > > layout and vote
> > >  if they feel it would be an acceptable design.
> > >  * We can define and vote and approve as many canonical alternative
> layouts
> > > as we want:
> > >* A canonical alternative layout should, at a minimum, have some
> > >  reasonable justification, such as improved performance for
> algorithm X
> > >  * Arrow implementations MUST support the primary layouts
> > >  * An Arrow implementation MAY support a canonical alternative,
> however:
> > >* An Arrow implementation MUST first support the primary layout
> > >* An Arrow implementation MUST support conversion to/from the
> primary
> > > and canonical layout
> > >* An Arrow implementation's APIs MUST only provide data in the
> > > alternative
> > >  layout if it is explicitly asked for (e.g. schema inference should
> > > prefer the primary layout).
> > >  * We can still vote for new primary layouts (e.g. promoting a
> canonical
> > > alternative) but, in these
> > > votes we don't only consider the value (e.g. performance) of the
> layout
> > > but also the interoperability.
> > > In other words, a layout can only become a primary layout if there
> is
> > > significant evidence that most
> > > implementations plan to adopt it.
> > >
> > > This lets us evolve support for new layouts more naturally.  We can
> > > generally assume that users will not, initially, be aware of these
> > > alternative layouts.  However, everything will just work.  They may
> start
> > > to see a performance penalty stemming from a lack of support for these
> > > layouts.  If this performance penalty becomes significant then they
> will
> > > discover it and become aware of the problem.  They can then ask
> whatever
> > > library they are using to add support for the alternative layout.  As
> > > enough users find a need for it then libraries will add support.
> > > Eventually, enough libraries will support it that we can adopt it as a
> > > primary layout.
> > >
> > > Also, it allows libraries to adopt alternative layouts more
> aggressively if
> > > they would like while still hopefully ensuring that we eventually all
> > > converge on the same implementation of the alternative layout.
> > >
> > > On Mon, May 22, 2023 at 9:35 AM Will Jones 
> > > wrote:
> > >
> > > > Hello Arrow devs,
> > > >
> > > > I don't understand why we would start deprecating features in the
> Arrow
> > > > > format. Even starting this talk might already be a bad idea
> PR-wise.
> > > > >
> > > >
> > > > I

1 2 3 4 5 >

1 - 100 of 427 matches

Mail list logo