Re: Register custom ExecNode factories

2022-10-14 Thread Benjamin Kietzman
I've opened a JIRA to track potential improvements to address this
https://issues.apache.org/jira/browse/ARROW-18063

On Wed, Sep 28, 2022 at 1:42 PM Li Jin  wrote:

> Thanks Yaron - I have figured something out. Currently I have created an
> internal c++ codebase that exposes the "Initialize" method and have an
> internal Python codebase that invokes it via Python/C++ bindings similar to
> how dataset does it.
>
> On Wed, Sep 28, 2022 at 1:02 PM Yaron Gvili  wrote:
>
> > I agree with Weston about dynamically loading a shared object with
> > initialization code for registering node factories. For custom node
> > factories, I think this loading would best be done from a separate Python
> > module, different than "_exec_plan.pyx", that the user would need to
> import
> > for triggering (once) the registration. This would avoid merging custom
> > code into "_exec_plan.pyx" and maintaining it. You would likely want to
> > code up files for your module that are analogous to
> > "python/pyarrow/includes/libarrow_dataset.pxd",
> > "python/pyarrow/_dataset.pxd", and "python/pyarrow/dataset.py". You would
> > need to modify the files "python/setup.py" and "python/CMakeLists.txt" in
> > order to build your module within PyArrow's build, or alternatively to
> roll
> > your own version of these files to build your Python module separately.
> > This is where you would add a build flag for pulling in C++ header files
> > for your Python module, under "python/pyarrow/include", and for making
> it.
> >
> >
> > Yaron.
> > 
> > From: Li Jin 
> > Sent: Wednesday, September 21, 2022 3:51 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: Register custom ExecNode factories
> >
> > Thanks Weston - I have not rewritten Python/C++ bridge so this is also
> new
> > to me and I am hoping to get some information from people that know how
> to
> > do this.
> >
> > I will leave this open for other people to offer help :) and will ask
> some
> > internal folks as well.
> >
> > Will circle back on this.
> >
> > On Tue, Sep 20, 2022 at 8:50 PM Weston Pace 
> wrote:
> >
> > > I'm not great at this build stuff but I think the basic idea is that
> > > you will need to package your custom nodes into a shared object.
> > > You'll need to then somehow trigger that shared object to load from
> > > python.  This seems like a good place to invoke the initialize method.
> > >
> > > Currently pyarrow has to do this because the datasets module
> > > (libarrow_dataset.so) adds some custom nodes (scan node, dataset write
> > > node).  The datasets module defines the Initialize method.  This
> > > method is called in _exec_plan.pyx when the python module is loaded.
> > > I don't know cython well enough to know how exactly it triggers the
> > > datasets shared object to load.
> > >
> > > On Tue, Sep 20, 2022 at 11:01 AM Li Jin  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Recently I am working on adding a custom data source node to Acero
> and
> > > was
> > > > pointed to a few examples in the dataset code.
> > > >
> > > > If I understand this correctly, the registering of dataset exec node
> is
> > > > currently happening when this is loaded:
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_exec_plan.pyx#L36
> > > >
> > > > I wonder if I have a custom "Initialize'' method that registers
> > > additional
> > > > ExecNode, where is the right place to invoke such initialization?
> > > > Eventually I want to execute my query via ibis-substrait and Acero
> > > > substrait consumer Python API.
> > > >
> > > > Thanks,
> > > > Li
> > >
> >
>


[DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Benjamin Kietzman
Hello all,

As previously discussed on this list [1], an UmbraDB/DuckDB/Velox compatible
"string view" type could bring several performance benefits to access and
authoring of string data in the arrow format [2]. Additionally better
interoperability with engines already using this format could be
established.

PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and
to
the IPC format. For the purposes of IPC raw pointers are not used. Instead,
each view contains a pair of 32 bit unsigned integers which encode the
index of
a character buffer (string view arrays may consist of a variable number of
such buffers) and the offset of a view's data within that buffer
respectively.
Benefits of this substitution include:
- This makes explicit the guarantee that lifetime of all character data is
equal
  to that of the array which views it, which is critical for confident
  consumption across an interface boundary.
- As with other types in the arrow format, such arrays are serializable and
  venue agnostic; directly usable in shared memory without modification.
- Indices and offsets are easily validated.

Accessing the data requires some trivial pointer arithmetic, but in
benchmarking
this had negligible impact on sequential access and only minor impact on
random
access.

In the C++ implementation, raw pointer string views are supported as an
extended
case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
Branching on
this access pattern bit at the data type level has negligible impact on
performance since the branch resides outside any hot loops. Utility
functions
are provided for efficient (potentially in-place) conversion between raw
pointer
and index offset views. For example, the C++ implementation could zero copy
a raw pointer array from Velox, filter it, then convert to index/offset for
serialization. Other implementations may choose to accommodate or eschew raw
pointer views as their communities direct.

Where desirous in a rigorously controlled context this still enables
construction
and safe consumption of string view arrays which reference memory not
directly bound to the lifetime of the array. I'm not sure when or if we
would
find it useful to have arrays like this; I do not introduce any in [3]. I
mention
this possibility to highlight that if benchmarking demonstrates that such an
approach brings a significant performance benefit to some operation, the
only
barrier to its adoption would be code review. Likewise if more intensive
benchmarking determines that raw pointer views critically outperform
index/offset
views for real-world analytics tasks, prioritizing raw pointer string views
for usage within the C++ implementation will be straightforward.

See also the proposal to Velox that their string view vector be refactored
in a similar vein [4].

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
[3] https://github.com/apache/arrow/pull/35628
[4] https://github.com/facebookincubator/velox/discussions/4362


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-17 Thread Benjamin Kietzman
@Jacob
> You mention benchmarks multiple times, are these results published
somewhere?

I benchmarked the performance of raw pointer vs index offset views in my PR
to velox,
I do intend to port them to my arrow PR but I haven't gotten there yet.
Furthermore, it
seemed less urgent to me since coexistence of the two types in the c++
implementation
defers the question of how aggressively one should be preferred over the
other.

@Dewey
> I don't see the C Data interface in the PR

I have not addressed the C ABI in this PR. As you mention, it may be useful
to transmit
arrays with raw pointer views between implementations which allow them. I
can address
this in a follow up PR.

@Will
> If I understand correctly, multiple arrays can reference the same buffers
> in memory, but once they are written to IPC their data buffers will be
> duplicated. Is that right?
The buffers will be duplicated. If buffer duplication is becomes a concern,
I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated could be detected
by checking
pointer identity and written only once.


On Wed, May 17, 2023 at 12:07 AM Will Jones  wrote:

> Hello Ben,
>
> Thanks for your work on this. I think this will be an excellent addition to
> the format.
>
> If I understand correctly, multiple arrays can reference the same buffers
> in memory, but once they are written to IPC their data buffers will be
> duplicated. Is that right?
>
> Dictionary types have a special message so they can be reused across
> batches and even fields. Did we consider adding a similar message for
> string view buffers?
>
> One relevant use case I'm curious about is substring extraction. For
> example, if I have a column of URIs and I create columns where I've
> extracted substrings like the hostname, path, and a list column of query
> parameters, I'd like for those latter columns to be views into the URI
> buffers, rather than full copies.
>
> However, I've never touched the IPC read code paths, so it's quite possible
> I'm overlooking something obvious.
>
> Best,
>
> Will Jones
>
>
> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
>  wrote:
>
> > Very cool!
> >
> > In addition to performance mentioned above, I could see this being
> > useful for the R bindings - we already have a global string pool and a
> > mechanism for keeping a vector of them alive.
> >
> > I don't see the C Data interface in the PR although I may have missed
> > it - is that a part of the proposal? It seems like it would be
> > possible to use raw pointers as long as they can be guaranteed to be
> > valid until the release callback is called?
> >
> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
> >  wrote:
> > >
> > > Hello Everyone,
> > > I think keeping interoperability with the large ecosystem is a very
> > > important goal for arrow so I am overall in favor of this proposal!
> > >
> > > You mention benchmarks multiple times, are these results published
> > > somewhere?
> > >
> > > Thanks
> > >
> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
> bengil...@gmail.com>
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> > > > compatible
> > > > "string view" type could bring several performance benefits to access
> > and
> > > > authoring of string data in the arrow format [2]. Additionally better
> > > > interoperability with engines already using this format could be
> > > > established.
> > > >
> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++
> implementation
> > and
> > > > to
> > > > the IPC format. For the purposes of IPC raw pointers are not used.
> > Instead,
> > > > each view contains a pair of 32 bit unsigned integers which encode
> the
> > > > index of
> > > > a character buffer (string view arrays may consist of a variable
> > number of
> > > > such buffers) and the offset of a view's data within that buffer
> > > > respectively.
> > > > Benefits of this substitution include:
> > > > - This makes explicit the guarantee that lifetime of all character
> > data is
> > > > equal
> > > >   to that of the array which views it, which is critical for
> confident
> > > >   consumption across an interface boundary.
> > > > - As with other types in the arrow format, such arrays are
> > serializa

Re: [VOTE][Format] Add experimental ArrowDeviceArray to C-Data API

2023-05-26 Thread Benjamin Kietzman
+1, thanks for all your work on this!

On Fri, May 26, 2023 at 11:09 AM Matt Topol  wrote:

> That makes 1 binding and one non-binding +1, as 3 binding votes are
> necessary I'm sending this to hopefully request more eyes here and get some
> more votes.
>
> Thanks all!
>
> On Thu, May 25, 2023 at 11:38 AM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > +1 for me.
> >
> > The C structs are clean and leave good room for extension.
> >
> > --
> > Felipe
> >
> > On Thu, May 25, 2023 at 12:04 PM David Li  wrote:
> >
> > > +1 for me.
> > >
> > > (Heads up: on the PR, there was some discussion since the last email
> and
> > > the meaning of 'experimental' was clarified.)
> > >
> > > On Tue, May 23, 2023, at 16:56, Matt Topol wrote:
> > > > To clarify:
> > > >
> > > >> Depends on what we're voting on?
> > > >
> > > > Voting on adopting the spec and adding it (while still leaving it
> > labeled
> > > > as "experimental" in the docs) to the format.
> > > >
> > > > --Matt
> > > >
> > > > On Tue, May 23, 2023 at 3:29 PM Matthew Topol
> > > 
> > > > wrote:
> > > >
> > > >> @Antoine: I've updated the PR with a prose description of the C
> Device
> > > Data
> > > >> interface. Sorry for the lack of that in the first place.
> > > >>
> > > >> --Matt
> > > >>
> > > >> On Tue, May 23, 2023 at 10:34 AM Antoine Pitrou  >
> > > >> wrote:
> > > >>
> > > >> >
> > > >> > Also, I forgot to say, but thanks a lot for doing this! We can
> hope
> > > this
> > > >> > will drastically improve interoperability between non-CPU data
> > > >> > frameworks and libraries.
> > > >> >
> > > >> > Regards
> > > >> >
> > > >> > Antoine.
> > > >> >
> > > >> >
> > > >> > Le 23/05/2023 à 16:32, Antoine Pitrou a écrit :
> > > >> > >
> > > >> > > Depends on what we're voting on?
> > > >> > >
> > > >> > > The C declarations seem fine to me (I'm a bit lukewarm on the
> > > reserved
> > > >> > > bits, but I understand the motivation), however I've posted
> > > comments as
> > > >> > > to how to document the interface. The current PR entirely lacks
> a
> > > prose
> > > >> > > description of the C Device Data Interface.
> > > >> > >
> > > >> > > Regards
> > > >> > >
> > > >> > > Antoine.
> > > >> > >
> > > >> > >
> > > >> > > Le 22/05/2023 à 18:02, Matt Topol a écrit :
> > > >> > >> Hello,
> > > >> > >>
> > > >> > >> Now that there's a rough consensus and a toy example POC[1], I
> > > would
> > > >> > like
> > > >> > >> to propose an official enhancement to the Arrow C-Data API
> > > >> > specification as
> > > >> > >> described in the PR[2]. The new
> > > >> ArrowDeviceArray/ArrowDeviceArrayStream
> > > >> > >> structs would be considered "experimental" and the
> documentation
> > > would
> > > >> > >> label them as such for the time being.
> > > >> > >>
> > > >> > >> Please comment, ask questions, and look at the PR and toy
> example
> > > POC
> > > >> as
> > > >> > >> needed.
> > > >> > >>
> > > >> > >> The vote will be open for at least 72 hours.
> > > >> > >>
> > > >> > >> [ ] +1 Add this to the C-Data API
> > > >> > >> [ ] +0
> > > >> > >> [ ] -1 Do not add this to the C-Data API because...
> > > >> > >>
> > > >> > >> Thank you very much everyone!
> > > >> > >> -- Matt
> > > >> > >>
> > > >> > >> [1]: https://github.com/zeroshade/arrow-non-cpu
> > > >> > >> [2]: https://github.com/apache/arrow/pull/34972
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
>


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
Hello again all,

The PR [1] to add string view to the format and the C++ implementation is
hovering around passing CI and has been undrafted. Furthermore, there is
now also a PR [2] to add string view to the Go implementation. Code review
is underway for each PR and I'd like to move toward a vote for acceptance-
are there any other preliminaries which I've neglected?

To reiterate the answers to some past questions:
- Benchmarks are added in the C++ PR [1] to demonstrate the performance of
  conversion between the various string formats. In addition, there are
  some benchmarks which demonstrate the performance gains available with
  the new format [3].
- Adding string view to the C ABI is a natural follow up, but should be
  handled independently. An issue has been added to track that
  enhancement [4].

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/35628
[2] https://github.com/apache/arrow/pull/35769
[3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
[4] https://github.com/apache/arrow/issues/36099

On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman 
wrote:

> @Jacob
> > You mention benchmarks multiple times, are these results published
> somewhere?
>
> I benchmarked the performance of raw pointer vs index offset views in my
> PR to velox,
> I do intend to port them to my arrow PR but I haven't gotten there yet.
> Furthermore, it
> seemed less urgent to me since coexistence of the two types in the c++
> implementation
> defers the question of how aggressively one should be preferred over the
> other.
>
> @Dewey
> > I don't see the C Data interface in the PR
>
> I have not addressed the C ABI in this PR. As you mention, it may be
> useful to transmit
> arrays with raw pointer views between implementations which allow them. I
> can address
> this in a follow up PR.
>
> @Will
> > If I understand correctly, multiple arrays can reference the same buffers
> > in memory, but once they are written to IPC their data buffers will be
> > duplicated. Is that right?
> The buffers will be duplicated. If buffer duplication is becomes a
> concern, I'd prefer to handle
> that in the ipc writer. Then buffers which are duplicated could be
> detected by checking
> pointer identity and written only once.
>
>
> On Wed, May 17, 2023 at 12:07 AM Will Jones 
> wrote:
>
>> Hello Ben,
>>
>> Thanks for your work on this. I think this will be an excellent addition
>> to
>> the format.
>>
>> If I understand correctly, multiple arrays can reference the same buffers
>> in memory, but once they are written to IPC their data buffers will be
>> duplicated. Is that right?
>>
>> Dictionary types have a special message so they can be reused across
>> batches and even fields. Did we consider adding a similar message for
>> string view buffers?
>>
>> One relevant use case I'm curious about is substring extraction. For
>> example, if I have a column of URIs and I create columns where I've
>> extracted substrings like the hostname, path, and a list column of query
>> parameters, I'd like for those latter columns to be views into the URI
>> buffers, rather than full copies.
>>
>> However, I've never touched the IPC read code paths, so it's quite
>> possible
>> I'm overlooking something obvious.
>>
>> Best,
>>
>> Will Jones
>>
>>
>> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
>>  wrote:
>>
>> > Very cool!
>> >
>> > In addition to performance mentioned above, I could see this being
>> > useful for the R bindings - we already have a global string pool and a
>> > mechanism for keeping a vector of them alive.
>> >
>> > I don't see the C Data interface in the PR although I may have missed
>> > it - is that a part of the proposal? It seems like it would be
>> > possible to use raw pointers as long as they can be guaranteed to be
>> > valid until the release callback is called?
>> >
>> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
>> >  wrote:
>> > >
>> > > Hello Everyone,
>> > > I think keeping interoperability with the large ecosystem is a very
>> > > important goal for arrow so I am overall in favor of this proposal!
>> > >
>> > > You mention benchmarks multiple times, are these results published
>> > > somewhere?
>> > >
>> > > Thanks
>> > >
>> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
>> bengil...@gmail.com>
>> > > wrote:
>> > >
>> &g

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
I've added https://github.com/apache/arrow/issues/36112 to track
deduplication of buffers on write.
I don't think it would require modification of the IPC format.

Ben

On Thu, Jun 15, 2023 at 1:30 PM Matt Topol  wrote:

> Based on my understanding, in theory a buffer *could* be shared within a
> batch since the flatbuffers message just uses an offset and length to
> identify the buffers.
>
> That said, I don't believe any current implementation actually does this or
> takes advantage of this in any meaningful way.
>
> --Matt
>
> On Thu, Jun 15, 2023 at 1:00 PM Will Jones 
> wrote:
>
> > Hi Ben,
> >
> > It's exciting to see this move along.
> >
> > The buffers will be duplicated. If buffer duplication is becomes a
> concern,
> > > I'd prefer to handle
> > > that in the ipc writer. Then buffers which are duplicated could be
> > detected
> > > by checking
> > > pointer identity and written only once.
> >
> >
> > Question: to be able to write buffer only once and reference in multiple
> > arrays, does that require a change to the IPC format? Or is sharing
> buffers
> > within the same batch already allowed in the IPC format?
> >
> > Best,
> >
> > Will Jones
> >
> > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman 
> > wrote:
> >
> > > Hello again all,
> > >
> > > The PR [1] to add string view to the format and the C++ implementation
> is
> > > hovering around passing CI and has been undrafted. Furthermore, there
> is
> > > now also a PR [2] to add string view to the Go implementation. Code
> > review
> > > is underway for each PR and I'd like to move toward a vote for
> > acceptance-
> > > are there any other preliminaries which I've neglected?
> > >
> > > To reiterate the answers to some past questions:
> > > - Benchmarks are added in the C++ PR [1] to demonstrate the performance
> > of
> > >   conversion between the various string formats. In addition, there are
> > >   some benchmarks which demonstrate the performance gains available
> with
> > >   the new format [3].
> > > - Adding string view to the C ABI is a natural follow up, but should be
> > >   handled independently. An issue has been added to track that
> > >   enhancement [4].
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://github.com/apache/arrow/pull/35628
> > > [2] https://github.com/apache/arrow/pull/35769
> > > [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > [4] https://github.com/apache/arrow/issues/36099
> > >
> > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <
> bengil...@gmail.com>
> > > wrote:
> > >
> > > > @Jacob
> > > > > You mention benchmarks multiple times, are these results published
> > > > somewhere?
> > > >
> > > > I benchmarked the performance of raw pointer vs index offset views in
> > my
> > > > PR to velox,
> > > > I do intend to port them to my arrow PR but I haven't gotten there
> yet.
> > > > Furthermore, it
> > > > seemed less urgent to me since coexistence of the two types in the
> c++
> > > > implementation
> > > > defers the question of how aggressively one should be preferred over
> > the
> > > > other.
> > > >
> > > > @Dewey
> > > > > I don't see the C Data interface in the PR
> > > >
> > > > I have not addressed the C ABI in this PR. As you mention, it may be
> > > > useful to transmit
> > > > arrays with raw pointer views between implementations which allow
> > them. I
> > > > can address
> > > > this in a follow up PR.
> > > >
> > > > @Will
> > > > > If I understand correctly, multiple arrays can reference the same
> > > buffers
> > > > > in memory, but once they are written to IPC their data buffers will
> > be
> > > > > duplicated. Is that right?
> > > > The buffers will be duplicated. If buffer duplication is becomes a
> > > > concern, I'd prefer to handle
> > > > that in the ipc writer. Then buffers which are duplicated could be
> > > > detected by checking
> > > > pointer identity and written only once.
> > > >
> > > >
> > > > On Wed, May 17, 2023 at 12:07 AM Will Jones  >
&g

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-19 Thread Benjamin Kietzman
Hi Gang,

I'm not sure what you mean, sorry if my answers are off base:

Parquet's ByteArray will be unaffected by the addition of the string view
type;
all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
during serialization to parquet [1].

If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
as encoding of equivalent arrow::Type::STRING, that's something I haven't
benchmarked so I can't answer definitively. I would expect it to be faster
than
first converting STRING_VIEW->STRING then encoding to parquet; direct
encoding avoids allocating and populating temporary buffers. Of course this
only applies to cases where you need to encode an array of STRING_VIEW to
parquet- encoding of STRING to parquet will be unaffected.

Sincerely,
Ben

[1]
https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179

On Thu, Jun 15, 2023 at 10:34 PM Gang Wu  wrote:

> Hi Ben,
>
> The posted benchmark [1] looks pretty good to me. However, I want to
> raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> uses a customized parquet::ByteArray type [2] for string/binary, I would
> expect some regression of conversions between parquet reader/writer
> and the proposed string view array, especially when some strings use
> short form and others use long form.
>
> [1]
>
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
>
> Best,
> Gang
>
> On Fri, Jun 16, 2023 at 3:58 AM Will Jones 
> wrote:
>
> > Cool. Thanks for doing that!
> >
> > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman 
> > wrote:
> >
> > > I've added https://github.com/apache/arrow/issues/36112 to track
> > > deduplication of buffers on write.
> > > I don't think it would require modification of the IPC format.
> > >
> > > Ben
> > >
> > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol 
> > wrote:
> > >
> > > > Based on my understanding, in theory a buffer *could* be shared
> within
> > a
> > > > batch since the flatbuffers message just uses an offset and length to
> > > > identify the buffers.
> > > >
> > > > That said, I don't believe any current implementation actually does
> > this
> > > or
> > > > takes advantage of this in any meaningful way.
> > > >
> > > > --Matt
> > > >
> > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones 
> > > > wrote:
> > > >
> > > > > Hi Ben,
> > > > >
> > > > > It's exciting to see this move along.
> > > > >
> > > > > The buffers will be duplicated. If buffer duplication is becomes a
> > > > concern,
> > > > > > I'd prefer to handle
> > > > > > that in the ipc writer. Then buffers which are duplicated could
> be
> > > > > detected
> > > > > > by checking
> > > > > > pointer identity and written only once.
> > > > >
> > > > >
> > > > > Question: to be able to write buffer only once and reference in
> > > multiple
> > > > > arrays, does that require a change to the IPC format? Or is sharing
> > > > buffers
> > > > > within the same batch already allowed in the IPC format?
> > > > >
> > > > > Best,
> > > > >
> > > > > Will Jones
> > > > >
> > > > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
> > bengil...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello again all,
> > > > > >
> > > > > > The PR [1] to add string view to the format and the C++
> > > implementation
> > > > is
> > > > > > hovering around passing CI and has been undrafted. Furthermore,
> > there
> > > > is
> > > > > > now also a PR [2] to add string view to the Go implementation.
> Code
> > > > > review
> > > > > > is underway for each PR and I'd like to move toward a vote for
> > > > > acceptance-
> > > > > > are there any other preliminaries which I've neglected?
> > > > > >
> > > > > > To reiterate the answer

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-21 Thread Benjamin Kietzman
 > Ben, at one point there was some discussion that this might be a c-data
> only type.  However, I believe that was based on the raw pointers
> representation.  What you've proposed here, if I understand correctly, is
> an index + offsets representation and it is suitable for IPC correct?
> (e.g. I see that you have changes and examples in the IPC reader/writer)

Yes, what I'm proposing here is an addition to the IPC format which
necessarily
does not reference raw pointers. This is also what's implemented in the
corresponding Go PR [1].

Since other engines which implement this data type *do* use raw pointers,
the c++ implementation can accept raw pointer string view arrays for
purposes
of validation and conversion to the IPC-compatible format.

[1] https://github.com/apache/arrow/pull/35769

On Tue, Jun 20, 2023 at 11:33 AM Weston Pace  wrote:

> Before I say anything else I'll say that I am in favor of this new layout.
> There is some existing literature on the idea (e.g. umbra) and your
> benchmarks show some nice improvements.
>
> Compared to some of the other layouts we've discussed recently (REE, list
> veiw) I do think this layout is more unique and fundamentally different.
> Perhaps most fundamentally different:
>
>  * This is the first layout where the number of buffers depends on the data
> and not the schema.  I think this is the most architecturally significant
> fact.  It does require a (backwards compatible) change to the IPC format
> itself, beyond just adding new type codes.  It also poses challenges in
> places where we've assumed there will be at most 3 buffers (e.g. in
> ArraySpan, though, as you have shown, we can work around this using a raw
> pointers representation internally in those spots).
>
> I think you've done some great work to integrate this well with Arrow-C++
> and I'm convinced it can work.
>
> I would be interested in hearing some input from the Rust community.
>
> Ben, at one point there was some discussion that this might be a c-data
> only type.  However, I believe that was based on the raw pointers
> representation.  What you've proposed here, if I understand correctly, is
> an index + offsets representation and it is suitable for IPC correct?
> (e.g. I see that you have changes and examples in the IPC reader/writer)
>
> On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman 
> wrote:
>
> > Hi Gang,
> >
> > I'm not sure what you mean, sorry if my answers are off base:
> >
> > Parquet's ByteArray will be unaffected by the addition of the string view
> > type;
> > all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
> > with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> > during serialization to parquet [1].
> >
> > If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
> > as encoding of equivalent arrow::Type::STRING, that's something I haven't
> > benchmarked so I can't answer definitively. I would expect it to be
> faster
> > than
> > first converting STRING_VIEW->STRING then encoding to parquet; direct
> > encoding avoids allocating and populating temporary buffers. Of course
> this
> > only applies to cases where you need to encode an array of STRING_VIEW to
> > parquet- encoding of STRING to parquet will be unaffected.
> >
> > Sincerely,
> > Ben
> >
> > [1]
> >
> >
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
> >
> > On Thu, Jun 15, 2023 at 10:34 PM Gang Wu  wrote:
> >
> > > Hi Ben,
> > >
> > > The posted benchmark [1] looks pretty good to me. However, I want to
> > > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> > > uses a customized parquet::ByteArray type [2] for string/binary, I
> would
> > > expect some regression of conversions between parquet reader/writer
> > > and the proposed string view array, especially when some strings use
> > > short form and others use long form.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> > > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Jun 16, 2023 at 3:58 AM Will Jones 
> > > wrote:
> > >
> > > > Cool. Thanks for doing that!
> > > >
> > > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman  >
> > > > wrote:
&

[VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-06-28 Thread Benjamin Kietzman
Hello,

I'd like to propose adding Utf8View arrays to the arrow format.
Previous discussion in [1], columnar format description in [2],
flatbuffers changes in [3].

There are implementations available in both C++[4] and Go[5] which
exercise the new type over IPC. Utf8View format demonstrates[6]
significant performance benefits over Utf8 in common tasks.

The vote will be open for at least 72 hours.

[ ] +1 add the proposed Utf8View type to the Apache Arrow format
[ ] -1 do not add the proposed Utf8View type to the Apache Arrow format
because...

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
[2]
https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
[3]
https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
[4] https://github.com/apache/arrow/pull/35628
[5] https://github.com/apache/arrow/pull/35769
[6] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-06 Thread Benjamin Kietzman
gt; can be done very efficiently by first verifying the entire buffer, and then
> verifying the offsets correspond to the start of a UTF-8 codepoint. This
> same approach would not be possible for StringView, which would need to
> verify individual values and would therefore be significantly more
> expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
> validation is perhaps more important for Rust than for other languages.
>
> I presume that StringView will behave similarly to dictionaries in that
> the selection kernels will not recompute the underlying value buffers. I
> think this is fine, but it is perhaps worth noting this has caused
> confusion in the past, as people somewhat reasonably expect an array
> post-selection to have memory usage reflecting the smaller selection. This
> is then especially noticeable if the data is written out to IPC, and still
> contains data that was supposedly filtered out. My 2 cents is that explicit
> selection vectors are a less surprising way to defer selection than baking
> it into the array, but I also don't have any workloads where this is the
> major bottleneck so can't speak authoritatively here.
>
> Which leads on to my major concern with this proposal, that it adds
> complexity and cognitive load to the specification and implementations,
> whilst not meaningfully improving the performance of the operators that I
> commonly encounter as performance bottlenecks, which are multi-column sorts
> and aggregations, or the expensive string operations such as matching or
> parsing. If we didn't already have a string representation I would be more
> onboard, but as it stands I'm definitely on the fence, especially given
> selection performance can be improved in less intrusive ways using
> dictionaries or selection vectors.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 02/07/2023 11:46, Andrew Lamb wrote:
>
>  * This is the first layout where the number of buffers depends on the
>
> data
>
> and not the schema. I think this is the most architecturally significant
> fact. I
>
>  I have spent some time reading the initial proposal -- thank you for
> that. I now understand what Weston was saying about the "variable numbers
> of buffers". I wonder if you considered restricting such arrays to a single
> buffer (so as to make them more similar to other arrow array types that
> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
> Pace  <mailto:weston.p...@gmail.com> wrote:
>
> Before I say anything else I'll say that I am in favor of this new layout.
> There is some existing literature on the idea (e.g. umbra) and your
> benchmarks show some nice improvements. Compared to some of the other
> layouts we've discussed recently (REE, list veiw) I do think this layout is
> more unique and fundamentally different. Perhaps most fundamentally
> different: * This is the first layout where the number of buffers depends
> on the data and not the schema. I think this is the most architecturally
> significant fact. It does require a (backwards compatible) change to the
> IPC format itself, beyond just adding new type codes. It also poses
> challenges in places where we've assumed there will be at most 3 buffers
> (e.g. in ArraySpan, though, as you have shown, we can work around this
> using a raw pointers representation internally in those spots). I think
> you've done some great work to integrate this well with Arrow-C++ and I'm
> convinced it can work. I would be interested in hearing some input from the
> Rust community. Ben, at one point there was some discussion that this might
> be a c-data only type. However, I believe that was based on the raw
> pointers representation. What you've proposed here, if I understand
> correctly, is an index + offsets representation and it is suitable for IPC
> correct? (e.g. I see that you have changes and examples in the IPC
> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
> bengil...@gmail.com> <mailto:bengil...@gmail.com> wrote:
>
> Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
> Parquet's ByteArray will be unaffected by the addition of the string view
> type; all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING,
> and with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> during serialization to parquet [1]. If you mean that encoding of
> arrow::Type::STRING_VIEW will not be as fast as encoding of equivalent
> arrow::Type::STRING, that's something I haven't benchmarked so I can't
> answer definitively. I would expect it to be
>
> faster
>
> than first converting STRING_VIEW->STRING then enc

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Benjamin Kietzman
Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
  - or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
  - ...
- do we extend RecordBatch to support canonical alternative layouts
  - or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
  - ...

To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `string Field::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts

This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- adds Field::alternative_layout and RecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example

Ben Kietzman

On Thu, Jul 13, 2023, 18:32 Aldrin  wrote:

> Thanks Neal and Weston!
>
> I prepared a diagram to solidify my own understanding of the context,
> which can be found at [1].
>
> I think alternative layouts sounds like a nice first approach to allowing
> new layouts that can be supported lazily (implemented when it is
> beneficial) by various implementations of the Arrow Columnar Format. But, I
> do think that it's just a (practical) formalization of saying what layouts
> are required and which ones are optional.
>
> From the making of the diagram, I also decided that the discussion isn't
> limited to performance, since there are several reasons new physical
> layouts may be proposed (or, at least, there are many aspects of
> performance). Even if it's not "canonical alternative layouts," I think it
> is important that there be some process for developers that use Arrow to
> propose extensions to the columnar format without having to prove out the
> benefits for libraries that use a different tech stack (e.g. rust vs C++ vs
> go).
>
>
> [1]:
> https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing
>
>
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> --- Original Message ---
> On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
>  wrote:
>
>
> > I am in favor of this proposal. IMO the Arrow project is the right place
> to
> > standardize both the interoperability and operability of columnar data
> > layouts. Data engines are a core component of the Arrow ecosystem and the
> > project should be able to grow with these data engines as they converge
> on
> > new layouts. Since columnar data is ubiquitous in analytical workloads,
> we
> > are seeing a natural progression into optimizing those workloads. This
> > includes new lossless compression schemes for columnar data that allows
> > engines to operate directly on the compressed data (e.g. RLE). If we
> can't
> > reliably support the growing needs of the broader data engine ecosystem
> in
> > a timely manner, then I also fear Arrow might lose relevancy over time.
> >
>
> > On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote:
> >
>
> > > Thank you Weston for proposing this solution and Neal for describing
> > > its context and implications. I agree with the other replies here—this
> > > seems like an elegant solution to a growing need that could, if left
> > > unaddressed, increase the fragmentation of the ecosystem and reduce
> > > the centrality of the Arrow format.
> > >
>
> > > Greater diversity of layouts is happening. Whether it happens inside
> > > of Arrow or outside of Arrow is up to us. I think we all would like to
> > > see it happen inside of Arrow. This proposal allows for that, while
> > > striking a balance as Raphael describes.
> > >
>
> > > However I think there is still some ambiguity about exactly how an
> > > Arrow implementation that is consuming/producing data would negotiate
> > > with an Arrow implementation or other component that is
> > > producing/consuming data to determine whether an alternative lay

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-08-03 Thread Benjamin Kietzman
Hello all,

I think that guarantees on masked values are worthwhile to define for more
than a
single type in isolation. In particular, requiring this exclusively for
Utf8View
will leave Utf8 and LargeUtf8 as arrays which *may* legally have non-utf8
masked
values but cannot be consumed by arrow-rs. Furthermore, this has been a
pain point
for the arrow format in the past since consumers would frequently benefit
from such
guarantees but have no way to negotiate them with producers. For example, R
uses a
specific NaN [1] to represent missing floating point values, and before 2.0
pandas
did something similar.

I think the simplest way to express this in an arrow schema would be a new
canonical
metadata field, `ARROW:masked_value_guarantee` or so: if the KeyValue is
present and
the value is "safe", then masked values are safe to access as if they were
not masked
and satisfy all type constraints that an unmasked value would. This field
can be set
on a Schema (or maybe on a Field?) by a producer to certify that the safety
guarantee
holds for all data in a file/stream, and optimizations such as arrow-rs'
whole-buffer
utf-8 validation may be performed safely. I'd tentatively suggest that
another useful
value to associate with this metadata key would be "zero"- IE masked
integers and
floats are all zero, strings under null bits are empty, etc.

Relatedly, in the C++ implementation prior to ARROW-2790 [2] masked values
in new arrays were
not initialized... *technically* fine if we never access those values but
if they are zeroed
then arithmetic can be performed independent of the null bitmap without
causing sanitizers
like valgrind to report access of unitialized values. I think the addition
of this guarantee
on masked values would be a useful formalization of performance/correctness
decisions
made in different ways by multiple implementations.

Supporting this should not require overly invasive changes to the c++
implementation;
I *think* it will only require specialization of the binary->utf8 cast.

Re Utf8View: thanks for your implementation @Raphael! I'll review it soon.
In the interest
of completeness, I also plan to write Utf8View as a canonical extension
type as described
by @Weston to see how that looks.

Sincerely,
Ben Kietzman

[1]:
https://github.com/wch/r-source/blob/e8b7fcc/src/main/arithmetic.c#L90-L98
(R_ValueOfNA)
[2]: https://issues.apache.org/jira/browse/ARROW-2790


Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-28 Thread Benjamin Kietzman
The vote passes with 7 binding +1 votes, 5 non-binding +1 votes, and no
-1/0 votes.

Thanks everyone!

On Thu, Aug 24, 2023 at 10:10 AM Antoine Pitrou  wrote:

>
> +1 on the format additions
>
> The implementations will probably need a bit more review back-and-forth.
>
> Regards
>
> Antoine.
>
>
> Le 28/06/2023 à 21:34, Benjamin Kietzman a écrit :
> > Hello,
> >
> > I'd like to propose adding Utf8View arrays to the arrow format.
> > Previous discussion in [1], columnar format description in [2],
> > flatbuffers changes in [3].
> >
> > There are implementations available in both C++[4] and Go[5] which
> > exercise the new type over IPC. Utf8View format demonstrates[6]
> > significant performance benefits over Utf8 in common tasks.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 add the proposed Utf8View type to the Apache Arrow format
> > [ ] -1 do not add the proposed Utf8View type to the Apache Arrow format
> > because...
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > [2]
> >
> https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
> > [3]
> >
> https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
> > [4] https://github.com/apache/arrow/pull/35628
> > [5] https://github.com/apache/arrow/pull/35769
> > [6] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >
>


[DISCUSS] Unsigned integers in Utf8View

2023-09-12 Thread Benjamin Kietzman
Hello all,

Utf8View was recently accepted [1] and I've opened a PR to add the
spec/schema changes [2]. In review [3], it was requested that signed 32 bit
integers be used for the fields of view structs instead of 32 bit unsigned.

This divergence has been discussed on the ML previously [4], but in light
of my reviewer's request for a change it should be raised again for focused
discussion. (At this stage, I don't *think* the change would require
another vote.) I'll enumerate the motivations for signed and unsigned as I
understand them.

Signed:
- signed integers are conventional in the arrow format
- unsigned integers may cause some difficulty of implementation in
languages which don't natively support them

Unsigned:
- unsigned integers are used by engines which already implement Utf8View

My own bias is toward compatibility with existing implementers, but using
signed integers will only affect the case of arrays which include data
buffers larger than 2GB. For reference, the default buffer size in velox is
32KB so such a massive data buffer would only occur when a single slot of a
string array has 2.1GB of characters. This seems sufficiently unlikely that
I wouldn't consider it a blocker.

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
[2] https://github.com/apache/arrow/pull/37526
[3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
[4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
[5]
https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50


Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

2023-09-19 Thread Benjamin Kietzman
Hello again all,

It seems there hasn't been much interest in this point so I'm leaning
toward keeping unsigned integers. If anyone has a concern please respond
here and/or on the PR [1].

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022

On Thu, Sep 14, 2023 at 9:31 AM David Li  wrote:

> I think Java was usually raised as the odd child out when this has come up
> before. Since Java 8 there are standard library methods to manipulate
> signed integers as if they were unsigned, so in principle Java shouldn't be
> a blocker anymore.
>
> That said, ByteBuffer is still indexed by int so in practice Java wouldn't
> be able to handle more than 2 GB in a single buffer, at least until we can
> use the Java 21+ APIs (MemorySegment is finally indexed by (signed) long).
>
> On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > Hello all,
> >
> > Utf8View was recently accepted [1] and I've opened a PR to add the
> > spec/schema changes [2]. In review [3], it was requested that signed 32
> bit
> > integers be used for the fields of view structs instead of 32 bit
> unsigned.
> >
> > This divergence has been discussed on the ML previously [4], but in light
> > of my reviewer's request for a change it should be raised again for
> focused
> > discussion. (At this stage, I don't *think* the change would require
> > another vote.) I'll enumerate the motivations for signed and unsigned as
> I
> > understand them.
> >
> > Signed:
> > - signed integers are conventional in the arrow format
> > - unsigned integers may cause some difficulty of implementation in
> > languages which don't natively support them
> >
> > Unsigned:
> > - unsigned integers are used by engines which already implement Utf8View
> >
> > My own bias is toward compatibility with existing implementers, but using
> > signed integers will only affect the case of arrays which include data
> > buffers larger than 2GB. For reference, the default buffer size in velox
> is
> > 32KB so such a massive data buffer would only occur when a single slot
> of a
> > string array has 2.1GB of characters. This seems sufficiently unlikely
> that
> > I wouldn't consider it a blocker.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > [2] https://github.com/apache/arrow/pull/37526
> > [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > [5]
> >
> https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
>


Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

2023-09-20 Thread Benjamin Kietzman
Hello all,

Thanks for the input!

@Will

> Could we name which ones it would be compatible with?

UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for size.

> As I understood it originally, the current implementations use raw
pointers

Yes, these implementations use raw pointers exclusively and for
transmission via IPC
those will require conversion. The compatibility we'd achieve by using
unsigned
integers is mostly one of documentation and validation; the above engines
would
accept a string as long as `2**32 - 1`, but with signed integers arrow
would reject
it. This is a distant edge case so I'd say it's acceptable to exclude it.
(For completeness it should be noted that conversion between signed and
unsigned is
typically a trivial operation and shouldn't be considered as adding to any
runtime cost.)

@Dewey
> ... the spec specifically discourages their use.
> If the times have changed and this is no longer the case perhaps there
> should be a wider effort to support unsigned values in other places

I'm not sure which signed integers you're thinking of. At the risk of
constructing a
strawman, I would say that for example we shouldn't extend the offsets in
the List type
to support unsigned integers. This would require adding a new type named
UnsignedOffsetsList or so, and I don't think that would offer a useful
difference from
the existing List type. I would recommend waiting for implementers to
motivate such a
change by proposing new cases where unsigned integers would be useful to
support
(as with unsigned dictionary indices).

> I would lean towards signed integers because we don't use unsigned
> integers anywhere in the existing specification

It's true this is the convention in the spec. Since you and @pitrou have
both mentioned
it (and we haven't heard strong opinions in support of unsigned sizes), I
think that's
sufficient grounds to avoid the exception.

Sincerely,
Ben Kietzman

[1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section 3.1,
second paragraph)
[2]
https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255
[3]
https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210

On Tue, Sep 19, 2023 at 4:50 PM Will Jones  wrote:

> Hi Ben,
>
> I'm open to the idea of using unsigned if it allows compatibility with an
> existing implementation or two. Could we name which ones it would be
> compatible with? Links to implementation code would be very welcome, if
> available.
>
> As I understood it originally, the current implementations use raw pointers
> rather than offsets into buffers, so these values already weren't
> compatible to begin with. For example, it seems like Velox uses raw
> pointers into buffers [1]. So unless I'm missing something, it seems like
> these implementations will have to map the pointers to offsets anyways, so
> maybe it's not much more trouble to convert to signed integers on the way.
>
> Best,
>
> Will Jones
>
> [1]
>
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78
>
> On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman 
> wrote:
>
> > Hello again all,
> >
> > It seems there hasn't been much interest in this point so I'm leaning
> > toward keeping unsigned integers. If anyone has a concern please respond
> > here and/or on the PR [1].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> >
> > On Thu, Sep 14, 2023 at 9:31 AM David Li  wrote:
> >
> > > I think Java was usually raised as the odd child out when this has come
> > up
> > > before. Since Java 8 there are standard library methods to manipulate
> > > signed integers as if they were unsigned, so in principle Java
> shouldn't
> > be
> > > a blocker anymore.
> > >
> > > That said, ByteBuffer is still indexed by int so in practice Java
> > wouldn't
> > > be able to handle more than 2 GB in a single buffer, at least until we
> > can
> > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed)
> > long).
> > >
> > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > > Hello all,
> > > >
> > > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > > spec/schema changes [2]. In review [3], it was requested that signed
> 32
> > > bit
> > > > integers be used for the fields of view structs instead of 32 bit
> &g

Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

2023-09-20 Thread Benjamin Kietzman
Thanks for the feedback, everyone!

I'm calling this in favor of signed integers. That's the last change
requested on the format PR, so I'll probably merge shortly.


On Wed, Sep 20, 2023, 17:43 Matt Topol  wrote:

> Just to chime in (and add yet another voice into the mix here), I'd have a
> preference for it being signed integers for the same reasons as most
> everyone else: consistency with everything else in the spec. Since we use
> signed integers everywhere, I'd prefer to keep it consistent rather than
> introduce an exception.
>
> --Matt
>
> On Wed, Sep 20, 2023 at 11:19 AM Benjamin Kietzman 
> wrote:
>
> > Hello all,
> >
> > Thanks for the input!
> >
> > @Will
> >
> > > Could we name which ones it would be compatible with?
> >
> > UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for
> size.
> >
> > > As I understood it originally, the current implementations use raw
> > pointers
> >
> > Yes, these implementations use raw pointers exclusively and for
> > transmission via IPC
> > those will require conversion. The compatibility we'd achieve by using
> > unsigned
> > integers is mostly one of documentation and validation; the above engines
> > would
> > accept a string as long as `2**32 - 1`, but with signed integers arrow
> > would reject
> > it. This is a distant edge case so I'd say it's acceptable to exclude it.
> > (For completeness it should be noted that conversion between signed and
> > unsigned is
> > typically a trivial operation and shouldn't be considered as adding to
> any
> > runtime cost.)
> >
> > @Dewey
> > > ... the spec specifically discourages their use.
> > > If the times have changed and this is no longer the case perhaps there
> > > should be a wider effort to support unsigned values in other places
> >
> > I'm not sure which signed integers you're thinking of. At the risk of
> > constructing a
> > strawman, I would say that for example we shouldn't extend the offsets in
> > the List type
> > to support unsigned integers. This would require adding a new type named
> > UnsignedOffsetsList or so, and I don't think that would offer a useful
> > difference from
> > the existing List type. I would recommend waiting for implementers to
> > motivate such a
> > change by proposing new cases where unsigned integers would be useful to
> > support
> > (as with unsigned dictionary indices).
> >
> > > I would lean towards signed integers because we don't use unsigned
> > > integers anywhere in the existing specification
> >
> > It's true this is the convention in the spec. Since you and @pitrou have
> > both mentioned
> > it (and we haven't heard strong opinions in support of unsigned sizes), I
> > think that's
> > sufficient grounds to avoid the exception.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section
> 3.1,
> > second paragraph)
> > [2]
> >
> >
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255
> > [3]
> >
> >
> https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210
> >
> > On Tue, Sep 19, 2023 at 4:50 PM Will Jones 
> > wrote:
> >
> > > Hi Ben,
> > >
> > > I'm open to the idea of using unsigned if it allows compatibility with
> an
> > > existing implementation or two. Could we name which ones it would be
> > > compatible with? Links to implementation code would be very welcome, if
> > > available.
> > >
> > > As I understood it originally, the current implementations use raw
> > pointers
> > > rather than offsets into buffers, so these values already weren't
> > > compatible to begin with. For example, it seems like Velox uses raw
> > > pointers into buffers [1]. So unless I'm missing something, it seems
> like
> > > these implementations will have to map the pointers to offsets anyways,
> > so
> > > maybe it's not much more trouble to convert to signed integers on the
> > way.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L4

[DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Benjamin Kietzman
Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4


Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Benjamin Kietzman
Hello all,

@Gang
> Could you please simply describe the layout of DuckDB and Velox

Arrow represents long (>12 bytes) strings with a view which includes
a buffer index (used to look up one of the variadic data buffers)
and an offset (used to find the start of a string's bytes within the
indicated buffer). DuckDB and Velox by contrast have a raw pointer
directly to the start of the string's bytes. Since these occupy the
same 8 bytes of a view, it's possible and fairly efficient to convert
from one representation to the other by modifying those 8 bytes in place.

@Raphael
> Is the motivation here to avoid DuckDB and Velox having to duplicate the
conversion logic from pointer-based to offset-based, or to allow
arrow-cpp to operate directly on pointer-based arrays?

It's more the latter; arrow C++ is intended to be useful as more than an IPC
serializer/deserializer, so it is beneficial to be able to import arrays
and also operate on them with no conversion cost. However it's also worth
noting that the raw pointer representation is more efficient on access,
albeit more expensive to validate along with a number of other tradeoffs.
In order to progress this work, I took this hybrid approach in part to defer
the question of which representation is preferred in which context. I would
like to allow the C++ library freedom to extract as much performance from
this type as possible, internally as well as when communicating with other
engines.

@Antoine
> What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property. I don't think
it's illegal or unreasonable for an implementation to diverge in their
internal handling of arrow data (whether to achieve performance,
consistency, or convenience).

> I'm not sure how DuckDB and Velox data could be exposed, but it could be
for example an extension type with a fixed_size_binary<16> storage type.

This wouldn't allow for the transmission of the variadic data buffers
which (even in the presence of raw pointer views) are necessary to
guarantee the lifetime of string data in the vector. Alternatively we
could use Utf8View with the high and low bits of the raw pointer
packed into the index and offset, but I don't think this would be less
tantamount to an unofficial arrow format.

Sincerely,
Ben Kietzman


On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou  wrote:

>
> Hello,
>
> What this PR is creating is an "unofficial" Arrow format, with data
> types exposed in Arrow C++ that are not part of the Arrow standard, but
> are exposed as if they were. Most users will probably not read the
> official format spec, but will simply trust the official Arrow
> implementations. So the official Arrow implementations have an
> obligation to faithfully represent the Arrow format and not breed
> confusion.
>
> So I'm -1 on the way the PR presents things currently.
>
> I'm not sure how DuckDB and Velox data could be exposed, but it could be
> for example an extension type with a fixed_size_binary<16> storage type.
>
> Regards
>
> Antoine.
>
>
>
> Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > In the PR to add support for Utf8View to the c++ implementation,
> > I've taken the approach of allowing raw pointer views [1] alongside the
> > index/offset views described in the spec [2]. This was done to ease
> > communication with other engines such as DuckDB and Velox whose native
> > string representation is the raw pointer view. In order to be usable
> > as a utility for writing IPC files and other operations on arrow
> > formatted data, it is useful for the library to be able to directly
> > import raw pointer arrays even when immediately converting these to
> > the index/offset representation.
> >
> > However there has been objection in review [3] since the raw pointer
> > representation is not part of the official format. Since data visitation
> > utilities are generic, IMHO this hybrid approach does not add
> > significantly to the complexity of the C++ library, and I feel the
> > aforementioned interoperability is a high priority when adding this
> > feature to the C++ library. It's worth noting that this interoperability
> > has been a stated goal of the Utf8Type since its original proposal [4]
> > and throughout the discussion of its adoption [5].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]:
> >
> https://github.com/apac

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Benjamin Kietzman
und in the C++ library). The raw pointer version also cannot be
> > >validated, but I see validation as less of a requirement and more of a
> > >"nice to have" (I realize others see validation as more of a
> requirement).
> > >
> > >* I see the raw-pointer type has having more net utility (going back to
> > the
> > >original motivation), but I also see how it is problematic for some
> > non-C++
> > >implementations.
> > >* The index-offset version is intrinsic value over the existing "dense"
> > >varbinary layout (per some of the benefits above) but does not satisfy
> the
> > >external interoperability goal with systems that are becoming more
> popular
> > >month over month
> > >* Incoming data from external systems that use the raw pointer model
> have
> > >to be serialized (and perhaps repacked) to the index-offset model. This
> > >isn't ideal — going the other way (from index-offset to raw pointer) is
> > >just a pointer swizzle, comparatively inexpensive.
> > >
> > >So it seems like we have several paths available, none of them wholly
> > >satisfactory:
> > >
> > >1. Essentially what's in the existing PR — the raw pointer variant which
> > is
> > >"non-standard"
> > >2. Pick one and only one for in memory — I think the raw pointer version
> > is
> > >more useful given that swizzling from index-offset is pretty cheap. But
> > the
> > >raw pointer version can't be validated safely and is problematic for
> e.g.
> > >Rust. Picking the index-offset version means that the external ecosystem
> > of
> > >columnar engines won't be that much closer aligned to Arrow than they
> are
> > >now.
> > >3. Implement the raw pointer variant as an extension type in C++ / C
> ABI.
> > >This seems potentially useful but given that it would likely be
> disfavored
> > >for data originating from Arrow-land, there would be fewer scenarios
> where
> > >zero-copy interop for strings is achieved
> > >
> > >This is difficult and I don't know what the best answer is, but
> personally
> > >my inclination has been toward choices that are utilitarian and help
> with
> > >alignment and cohesion in the open source ecosystem.
> > >
> > >- Wes
> > >
> > >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> To make things clear, any of the factory functions listed below
> create a
> > >> type that maps exactly onto an Arrow columnar layout:
> > >>
> >
> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
> > >>
> > >> For example, calling `arrow::dictionary` creates a dictionary type
> that
> > >> exactly represents the dictionary layout specified in
> > >>
> > >>
> >
> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
> > >>
> > >> Similarly, if you use any of the builders listed below, what you will
> > >> get at the end is data that complies with the Arrow columnar
> > specification:
> > >> https://arrow.apache.org/docs/dev/cpp/api/builder.html
> > >>
> > >> All the core Arrow C++ APIs create and process data which complies
> with
> > >> the Arrow specification, and which is interoperable with other Arrow
> > >> implementations.
> > >>
> > >> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
> > >> etc.) goes through dedicated converters. There is no ambiguity.
> > >>
> > >>
> > >> Creating top-level utilities that create non-Arrow data introduces
> > >> confusion and ambiguity as to what Arrow is. Users who haven't studied
> > >> the spec in detail - which is probably most users of Arrow
> > >> implementations - will call `arrow::string_view(raw_pointers=true)`
> and
> > >> might later discover that their data cannot be shared with other
> > >> implementations (or, if it can, there will be an unsuspected
> conversion
> > >> cost at the edge).
> > >>
> > >> It also creates a risk of introducing a parallel Arrow-like ecosystem
> > >> based on the superset of data layouts understood by Arrow C++. People
> > >> may feel encouraged to code for that ecosystem, pessimizing
> > >> interoperability with non-C++ runtimes.
> > >>
&

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread Benjamin Kietzman
+1

On Fri, Sep 29, 2023 at 10:51 AM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Yes, ListView is an implementation of Velox's ArrayVector [1] ("vector of
> arrays"). In Arrow we would naturally refer to them as "array of lists",
> but `ListArray` is taken by the existing offset-only list formats.
> Following the pattern adopted by other types in Arrow that use offsets and
> sizes, we adopt the suffix -View to differentiate list-views from lists.
>
> Velox doesn't offer the 64-bit variation, but since Arrow has both List and
> LargeList, it was natural to pair them with ListView and LargeListView.
>
> [2] is a link to the point of a talk by Mark Raasveldt where he describes
> the DuckDB list representation. Early in the talk, one of the slides [3]
> mentions how these formats were "co-designed together with Velox team".
>
> --
> Felipe
>
> [1]
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> [2] https://youtu.be/bZOvAKGkzpQ?si=wgSwew3Ck8utteOI&t=1569
> [3] https://15721.courses.cs.cmu.edu/spring2023/slides/22-duckdb.pdf
>
> On Fri, Sep 29, 2023 at 9:32 AM Raphael Taylor-Davies
>  wrote:
>
> > Hi Felipe,
> >
> > Can I confirm that DuckDB and Velox use the same encoding for these
> > types, and so we aren't going to run into similar issues as [1]?
> >
> > Kind Regards,
> >
> > Raphael Taylor-Davies
> >
> > [1]: https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
> >
> > On 29/09/2023 13:09, Felipe Oliveira Carvalho wrote:
> > > Hello,
> > >
> > > I'd like to propose adding ListView and LargeListView arrays to the
> Arrow
> > > format.
> > > Previous discussion in [1][2], columnar format description and
> > flatbuffers
> > > changes in [3].
> > >
> > > There are implementations available in both C++ [4] and Go [5]. I'm
> > working
> > > on the integration tests which I will push to one of the PR branches
> > before
> > > they are merged. I've made a graph illustrating how this addition
> > affects,
> > > in a backwards compatible way, the type predicates and inheritance
> chain
> > on
> > > the C++ implementation. [6]
> > >
> > > The vote will be open for at least 72 hours not counting the weekend.
> > >
> > > [ ] +1 add the proposed ListView and LargeListView types to the Apache
> > > Arrow format
> > > [ ] -1 do not add the proposed ListView and LargeListView types to the
> > > Apache Arrow format
> > > because...
> > >
> > > Sincerely,
> > > Felipe
> > >
> > > [1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
> > > [2] https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
> > > [3] https://github.com/apache/arrow/pull/37877
> > > [4] https://github.com/apache/arrow/pull/35345
> > > [5] https://github.com/apache/arrow/pull/37468
> > > [6] https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db
> > >
> >
>


Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Benjamin Kietzman
@Antoine
> By construction, a standard cannot continuously chase the performance
state of
> art, it has to weigh the benefits of performance improvements against the
> increased cost for the ecosystem.

> We have extension types which could reasonably be used for non-standard
> data types, especially the kind that are motivated by leading-edge
> performance research and innovation and come with unusual constraints
> (such as requiring trusting and dereferencing raw pointers embedded in
> data buffers). There could even be an argument for making some of them
> canonical extension types if there's enough anteriority in favor.

I agree that the standard becoming unfocused and ala carte is to be avoided.
However I would argue that the addition of raw pointer views as a C ABI
extension type represents little of that danger. The addition is entirely
bounded as it is binary (with or without raw pointers), and is semantically
identical to the index/offset representation with the caveat of differing
dereferencing.

This last makes it more nuanced than the cases covered by canonical
extension
types. If raw pointer views were a canonical extension using index/offset
views
as their storage, they could not be naively validated by considering the
storage
alone since raw pointers couldn't be validly reinterpreting as an
index/offset pair.
In light of this and in the context of the C ABI I'd advocate for a
dedicated
data type descriptor [1] for raw pointer views to make the distinction more
obvious,
but otherwise I think it is reasonable to consider them an extension type.

I am currently working on a draft PR adding this type to the C ABI spec by
way of
a proposal for everyone's consideration.

@Raphael
> Do you have any benchmarks comparing kernels with native pointer array
support,
> compared to those that must first convert to the offset representation? I
think
> this would help ground this discussion empirically.

Conversion cost can end up being a bottleneck for rapid operations, much
more so
than the overhead of accessing either representation (hence the advantage of
operating in place on whatever representation we have). For a simple
comparison,
I benchmarked checking random views in each representation for alphanumeric
characters. Raw pointer views have a 6% advantage.

```
Run on (16 X 5300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
---
Benchmark Time CPU
Iterations UserCounters...
---
IsAlnum 24086566 ns 24084367 ns
  29 bytes_per_second=664.276M/s items_per_second=43.5376M/s
IsAlnum  22613218 ns 22612044 ns
  31 bytes_per_second=707.529M/s items_per_second=46.3725M/s
```

Below is a table of benchmarks for conversion between representations.
To summarize those:

- Conversion cost is strongly dependent on string lengths.
  - I would tentatively focus on `kUsuallyInlineable` as most representative
- Conversion from views to dense strings is slowest since we must copy each
  view's characters into a single character buffer.
  - comparable to parsing 64 bit floats
  - preallocation is not perfect, so the character buffer must be check for
resizing inside the hot loop
- Conversion from dense strings to views is 2-4x faster
  - comparable to parsing 64 bit integers
  - we only need to allocate once before conversion
  - we still need to access the character buffer in order to copy inline
contents
or cached prefixes into the headers
- Conversion from index/offset to raw pointer views is fairly quick
  - comparable to integer->integer conversion with safe overflow and high
null percentage
- Conversion from raw pointer to index/offset views is 2/3 as fast

```
--
Benchmark
 Time CPU   Iterations UserCounters...
--
ConvertViews<(from type), (to type), (length category)>
--
ConvertViews
   42877057 ns 42875885 ns   16 items_per_second=32.6081M/s
  kUsuallyInlineable>
  34079672 ns 34075604 ns   21 items_per_second=30.772M/s
  kShortButNeverInlineable>
  16044043 ns 16043702 ns   43 items_per_second=34.8573M/s
  kLongAndSeldomInlineable>
   1717984 ns  1717955 ns  376 items_per_second=38.1477M/s
   

Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Benjamin Kietzman
+1

On Fri, Oct 6, 2023, 17:27 Felipe Oliveira Carvalho 
wrote:

> Hello,
>
> I'm writing to propose "+vl" and "+vL" as format strings for list-view and
> large list-view arrays passing through the Arrow C data interface [1].
>
> The previous proposal was considered a bad idea because existing parsers of
> these format strings might be looking at only the first `l` (or `L`) after
> the `+` and assuming the classic list format from that alone, so now I'm
> proposing we start with a `+v` as this prefix is not shared with any other
> existing type so far.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --
> Felipe
>
> [1] https://arrow.apache.org/docs/format/CDataInterface.html
>


[VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Benjamin Kietzman
Hello all,

I propose "vu" and "vz" as format strings for the Utf8View and
BinaryView types in the Arrow C data interface [1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of these new C data format strings
[ ] +0
[ ] -1 - I'm against adding these new format strings because

Ben Kietzman

[1] https://arrow.apache.org/docs/format/CDataInterface.html


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-23 Thread Benjamin Kietzman
Hello,

We have 6 +1 votes and no -1 votes, so the vote passes.

Thanks, everyone!
Ben Kietzman


On Thu, Oct 19, 2023 at 3:05 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> +1
>
> On Wed, 18 Oct 2023 at 23:33, Jonathan Keane  wrote:
> >
> > +1
> >
> > -Jon
> >
> >
> > On Wed, Oct 18, 2023 at 2:26 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> > > +1
> > >
> > > On Wed, Oct 18, 2023 at 2:49 PM Dewey Dunnington
> > >  wrote:
> > >
> > > > +1!
> > > >
> > > > On Wed, Oct 18, 2023 at 2:14 PM Matt Topol 
> > > wrote:
> > > > >
> > > > > +1
> > > > >
> > > > > On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou  >
> > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > > > > > BinaryView types in the Arrow C data interface [1].
> > > > > > >
> > > > > > > The vote will be open for at least 72 hours.
> > > > > > >
> > > > > > > [ ] +1 - I'm in favor of these new C data format strings
> > > > > > > [ ] +0
> > > > > > > [ ] -1 - I'm against adding these new format strings
> because
> > > > > > >
> > > > > > > Ben Kietzman
> > > > > > >
> > > > > > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > > > > > >
> > > > > >
> > > >
> > >
>


[DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Benjamin Kietzman
Hello all,

The C ABI does not store buffer lengths explicitly, which presents a
problem for Utf8View since buffer lengths are not trivially extractable
from other data in the array. A potential solution is to store the lengths
in an extra buffer after the variadic data buffers. I've adopted this
approach in my (currently draft) PR [1] to add c++ library import/export
for Utf8VIew, but I thought this warranted raising on the ML in case anyone
has a better idea.

Sincerely,
Ben Kietzman

[1]
https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Benjamin Kietzman
Worth noting: the c data interface explicitly forbids adding new members
[1] to its structs, so simply adding ArrowArray::buffer_sizes is not viable.

[1]
https://github.com/bkietz/arrow/blob/0afb739a16672483b69894c6fe3f5ece5cfc79d8/docs/source/format/CDataInterface.rst?plain=1#L984-L986

On Wed, Oct 25, 2023 at 12:36 PM Benjamin Kietzman 
wrote:

> Hello all,
>
> The C ABI does not store buffer lengths explicitly, which presents a
> problem for Utf8View since buffer lengths are not trivially extractable
> from other data in the array. A potential solution is to store the lengths
> in an extra buffer after the variadic data buffers. I've adopted this
> approach in my (currently draft) PR [1] to add c++ library import/export
> for Utf8VIew, but I thought this warranted raising on the ML in case anyone
> has a better idea.
>
> Sincerely,
> Ben Kietzman
>
> [1]
> https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549
>


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-26 Thread Benjamin Kietzman
Congratulations!

On Thu, Oct 26, 2023 at 10:35 AM Dane Pitkin 
wrote:

> Congratulations, Xuwei!
>
> On Thu, Oct 26, 2023 at 9:34 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Congrats!
> >
> > On Wed, 25 Oct 2023 at 08:23, Ian Joiner  wrote:
> > >
> > > Congrats!
> > >
> > > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei 
> wrote:
> > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > > > has accepted an invitation to become a committer on Apache
> > > > Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > --
> > > > kou
> > > >
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Benjamin Kietzman
> Is this buffer lengths buffer only present if the array type is Utf8View?

IIUC, the proposal would add the buffer lengths buffer for all types if the
schema's
flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
the special case and that `n_buffers` would continue to be consistent with
IPC.

On Thu, Oct 26, 2023 at 1:35 PM Weston Pace  wrote:

> Is this buffer lengths buffer only present if the array type is Utf8View?
> Or are you suggesting that other types might want to adopt this as well?
>
> On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington
>  wrote:
>
> > > I expect C code to not be much longer then this :-)
> >
> > nanoarrow's buffer-length-calculation and validation concepts are
> > (perhaps inadvisably) intertwined...even with both it is not that much
> > code (perhaps I was remembering how much time it took me to figure out
> > which 35 lines to write :-))
> >
> > > That sounds a bit hackish to me.
> >
> > Including only *some* buffer sizes in array->buffers[array->n_buffers]
> > special-cased for only two types (or altering the number of buffers
> > required by the IPC format vs. the number of buffers required by the C
> > Data interface) seem equally hackish to me (not that I'm opposed to
> > either necessarily...the alternatives really are very bad).
> >
> > > How can you *not* care about buffer sizes, if you for example need to
> > send the buffers over IPC?
> >
> > I think IPC is the *only* operation that requires that information?
> > (Other than perhaps copying to another device?) I don't think there's
> > any barrier to accessing the content of all the array elements but I
> > could be mistaken.
> >
> > On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > > > The lack of buffer sizes is something that has come up for me a few
> > > > times working with nanoarrow (which dedicates a significant amount of
> > > > code to calculating buffer sizes, which it uses to do validation and
> > > > more efficient copying).
> > >
> > > By the way, this is a bit surprising since it's really 35 lines of code
> > > in C++ currently:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
> > >
> > > I expect C code to not be much longer then this :-)
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-27 Thread Benjamin Kietzman
> This begs the question of what happens if a consumer receives an unknown
> flag value.

It seems to me that ignoring unknown flags is the primary case to consider
at
this point, since consumers may ignore unknown flags. Since that is the
case,
it seems adding any flag which would break such a consumer would be
tantamount to an ABI breakage. I don't think this can be averted unless all
consumers are required to error out on unknown flag values.

In the specific case of Utf8View it seems certain that consumers would add
support for the buffer sizes flag simultaneously with adding support for the
new type (since Utf8View is difficult to import otherwise), so any consumer
which would error out on the new flag would already be erroring out on an
unsupported data type.

> I might be the only person who has implemented
> a deep copy of an ArrowSchema in C, but it does blindly pass along a
> schema's flag value

I think passing a schema's flag value including unknown flags is an error.
The ABI defines moving structures but does not define deep copying. I think
in order to copy deeply in terms of operations which *are* specified: we
import then export the schema. Since this includes an export step, it
should not
include flags which are not supported by the exporter.

On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou  wrote:

>
> Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> >> Is this buffer lengths buffer only present if the array type is
> Utf8View?
> >
> > IIUC, the proposal would add the buffer lengths buffer for all types if
> the
> > schema's
> > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > the special case and that `n_buffers` would continue to be consistent
> with
> > IPC.
>
> This begs the question of what happens if a consumer receives an unknown
> flag value. We haven't specified that unknown flag values should be
> ignored, so a consumer could judiciously choose to error out instead of
> potentially misinterpreting the data.
>
> All in all, personally I'd rather we make a special case for Utf8View
> instead of adding a flag that can lead to worse interoperability.
>
> Regards
>
> Antoine.
>


[ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread Benjamin Kietzman
On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira
Carvalho
has accepted an invitation to become a committer on Apache
Arrow. Welcome, and thank you for your contributions!

Ben Kietzman


[DISCUSS] Semantics of extension types

2023-12-13 Thread Benjamin Kietzman
Hello all,

Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
any extension type to its storage type in acero. This raises questions
about the validity of applying operations to an extension array's storage.
For example, some extension type authors may intend different ordering for
arrays of their new type than would be applied to the array's storage or
may not intend for the type to participate in arithmetic even though its
storage could.

Suggestions/observations from discussion on that PR included:
- Extension types could provide general semantic description of storage
type equivalence [2], so that a flag on the extension type enables ordering
by storage but disables arithmetic on it
- Compute functions or kernels could be augmented with a filter declaring
which extension types are supported [3].
- Currently arrow-rs considers extension types metadata only [4], so all
kernels treat extension arrays equivalently to their storage.
- Currently arrow c++ only supports explicitly casting from an extension
type to its storage (and from storage to ext), so any operation can be
performed on an extension array's storage but it requires opting in.

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/39200
[2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
[3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
[4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651


Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Benjamin Kietzman
The main problem I see with adding properties to ExtensionType is I'm not
sure where that information would reside. Allowing type authors to declare
in which ways the type is equivalent (or not) to its storage is appealing,
but it seems to need an official extension field like
ARROW:extension:semantics. Otherwise I think each extension type's
semantics would need to be maintained within every implementation as well
as in a central reference (probably in Columnar.rst), which seems
unreasonable to expect of extension type authors. I'm also skeptical that
useful information could be packed into an ARROW:extension:semantics field;
even if the type can declare that ordering-as-with-storage is invalid I
don't think it'd be feasible to specify the correct ordering.

If we cannot attach this information to extension types, the question
becomes which defaults are most reasonable for engines and how can the
engine most usefully be configured outside those defaults. My own
preference would be to refuse operations other than selection or
casting-to-storage, with a runtime extensible registry of allowed implicit
casts. This will allow users of the engine to configure their extension
types as they need, and the error message raised when an implicit
cast-to-storage is not allowed could include the suggestion to register the
implicit cast. For applications built against a specific engine, this
approach would allow recovering much of the advantage of attaching
properties to an ExtensionType by including registration of implicit casts
in the ExtensionType's initialization.

On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
wrote:

> Hello all,
>
> Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
> any extension type to its storage type in acero. This raises questions
> about the validity of applying operations to an extension array's storage.
> For example, some extension type authors may intend different ordering for
> arrays of their new type than would be applied to the array's storage or
> may not intend for the type to participate in arithmetic even though its
> storage could.
>
> Suggestions/observations from discussion on that PR included:
> - Extension types could provide general semantic description of storage
> type equivalence [2], so that a flag on the extension type enables ordering
> by storage but disables arithmetic on it
> - Compute functions or kernels could be augmented with a filter declaring
> which extension types are supported [3].
> - Currently arrow-rs considers extension types metadata only [4], so all
> kernels treat extension arrays equivalently to their storage.
> - Currently arrow c++ only supports explicitly casting from an extension
> type to its storage (and from storage to ext), so any operation can be
> performed on an extension array's storage but it requires opting in.
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/39200
> [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
>


Re: [DISCUSS] Proposal to expand Arrow Communications

2024-02-13 Thread Benjamin Kietzman
Thanks for working on this!

I think the flexibility of independent metadata and data storage and
transfer is quite promising. I've added some comments

On Mon, Feb 12, 2024 at 8:45 PM Matt Topol  wrote:

> Just pinging on this thread to hopefully encourage more comments and
> engagement on the document. I still have to respond to a few of Antoine's
> open comments, but so far there's only be the one individual who has given
> feedback.
>
> I've added a large "background context" section at the top of the document
> to hopefully make it easier for individuals to understand the motivation
> and comment on the document.
>
> If there's no objections by the end of the week I'd like to propose a vote
> to officially adopt this and start putting a PR together for updating the
> docs. But personally I'd rather see more comments than it being adopted by
> default of no one objecting.
>
> --Matt
>
> On Wed, Feb 7, 2024, 11:50 AM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Matt
> >
> > Thanks for sharing. It looks interesting and I like the idea.
> >
> > Let me review the document and eventually add comments.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Sat, Feb 3, 2024 at 12:22 AM Matt Topol 
> wrote:
> > >
> > > Hey all,
> > >
> > > In my current work I've been experimenting and playing around with
> > > utilizing Arrow and non-cpu memory data. While the creation of the
> > > ArrowDeviceArray struct and the enhancements to the Arrow library
> Device
> > > abstractions were necessary, there is also a need to extend the
> > > communications specs we utilize, i.e. Flight.
> > >
> > > Currently there is no real way to utilize Arrow Flight with shared
> memory
> > > or with non-CPU memory (without an expensive Device -> Host copy
> first).
> > To
> > > this end I've done a bunch of research and toying around and came up
> > with a
> > > protocol to propose and a reference implementation using UCX[1].
> Attached
> > > to the proposal is also a couple extensions for Flight itself to make
> it
> > > easier for users to still use Flight for metadata / dataset information
> > and
> > > then point consumers elsewhere to actually retrieve the data. The idea
> > here
> > > is that this would be a new specification for how to transport Arrow
> data
> > > across these high-performance transports such as UCX / libfabric /
> shared
> > > memory / etc. We wouldn't necessarily expose / directly add
> > implementations
> > > of the spec to the Arrow libraries, just provide reference/example
> > > implementations.
> > >
> > > I've written the proposal up on a google doc[2] that everyone should be
> > > able to comment on. Once we get some community discussion on there, if
> > > everyone is okay with it I'd like eventually do a vote on adopting this
> > > spec and if we do, I'll then make a PR to start adding it to the Arrow
> > > documentation, etc.
> > >
> > > Anyways, thank you everyone in advance for your feedback and comments!
> > >
> > > --Matt
> > >
> > > [1]: https://github.com/openucx/ucx/
> > > [2]:
> > >
> >
> https://docs.google.com/document/d/1zHbnyK1r6KHpMOtEdIg1EZKNzHx-MVgUMOzB87GuXyk/edit?usp=sharing
> >
>


Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-05 Thread Benjamin Kietzman
+1 (binding)

On Tue, Mar 5, 2024, 13:03 Peter Toth  wrote:

> +1 (non-binding)
>
> Parth Chandra  ezt írta (időpont: 2024. márc. 5., K,
> 18:15):
>
> > +1 (non-binding)
> >
> > On Sun, Mar 3, 2024 at 8:04 PM Mehmet Ozan Kabak 
> wrote:
> >
> > > +1 (non-binding)
> > > --
> > > *Mehmet Ozan Kabak*
> > > Co-founder & CEO @ Synnada, Inc.
> > >
> > >
> > > On Sun, Mar 3, 2024 at 7:33 PM Jacob Wujciak-Jens
> > >  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > On Mon, Mar 4, 2024 at 3:39 AM Yang Jiang 
> > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On 2024/03/01 18:08:26 Daniël Heres wrote:
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Fri, Mar 1, 2024, 19:05 Chao Sun  wrote:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > On Fri, Mar 1, 2024 at 9:59 AM QP Hou 
> wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > exciting milestone :)
> > > > > > > >
> > > > > > > > On Fri, Mar 1, 2024 at 9:49 AM David Li  >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > On Fri, Mar 1, 2024, at 12:06, Jorge Cardoso Leitão wrote:
> > > > > > > > > > +1 - great work!!!
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 1, 2024 at 5:49 PM Micah Kornfield <
> > > > > > > emkornfi...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> +1 (binding)
> > > > > > > > > >>
> > > > > > > > > >> On Friday, March 1, 2024, Uwe L. Korn  >
> > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > +1 (binding)
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Mar 1, 2024, at 2:37 PM, Andy Grove wrote:
> > > > > > > > > >> > > +1 (binding)
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Fri, Mar 1, 2024 at 6:20 AM Weston Pace <
> > > > > > > weston.p...@gmail.com
> > > > > > > > >
> > > > > > > > > >> > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > >> +1 (binding)
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb <
> > > > > > > al...@influxdata.com
> > > > > > > > >
> > > > > > > > > >> > wrote:
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> > Hello,
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > As we have discussed[1][2] I would like to vote
> on
> > > the
> > > > > > > > proposal to
> > > > > > > > > >> > >> > create a new Apache Top Level Project for
> > DataFusion.
> > > > The
> > > > > > > text
> > > > > > > > of
> > > > > > > > > >> the
> > > > > > > > > >> > >> > proposed resolution and background document is
> > > > > copy/pasted
> > > > > > > > below
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > If the community is in favor of this, we plan to
> > > submit
> > > > > the
> > > > > > > > > >> resolution
> > > > > > > > > >> > >> > to the ASF board for approval with the next Arrow
> > > > report
> > > > > (for
> > > > > > > > the
> > > > > > > > > >> > >> > April 2024 board meeting).
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > The vote will be open for at least 7 days.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > [ ] +1 Accept this Proposal
> > > > > > > > > >> > >> > [ ] +0
> > > > > > > > > >> > >> > [ ] -1 Do not accept this proposal because...
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > Andrew
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > [1]
> > > > > > > > > >>
> > > > > https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> > > > > > > > > >> > >> > [2]
> > > > > > > > https://github.com/apache/arrow-datafusion/discussions/6475
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > -- Proposed Resolution -
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > Resolution to Create the Apache DataFusion
> Project
> > > from
> > > > > the
> > > > > > > > Apache
> > > > > > > > > >> > >> > Arrow DataFusion Sub Project
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> >
> > > > > =
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > X. Establish the Apache DataFusion Project
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > WHEREAS, the Board of Directors deems it to be in
> > the
> > > > > best
> > > > > > > > > >> > >> > interests of the Foundation and consistent with
> the
> > > > > > > > > >> > >> > Foundation's purpose to establish a Project
> > > Management
> > > > > > > > > >> > >> > Committee charged with the creation and
> maintenance
> > > of
> > > > > > > > > >> > >> > open-source software related to an extensible
> query
> > > > > engine
> > > > > > > > > >> > >> > for distribution at no charge to the public.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > NOW, THEREFORE, BE IT RESOLVED, that a Project
> > > > Management
> > > > > > > > > >> > >> > Committee (PMC), to be known as the "Apache
> > > DataFusion
> > > > > > > > Project",
> > > > > > > > > >> > >> > be and hereby is established pursuant to Bylaws
> of
> > > the
> > >

[DISCUSS][C++] Help needed to refactor Skyhook

2024-03-13 Thread Benjamin Kietzman
Skyhook [1] enables efficient predicate and projection pushdown from
Arrow Dataset to a Ceph storage cluster. This is very cool
functionality, but it's tightly coupled to the Arrow C++ Dataset
implementation in a way which blocks refactoring. In the Arrow C++
codebase today, Acero is designed specifically to handle projection
and filtration in a more modular fashion, and to accept configuration
from standardized plan/expression formats like Substrait. In light of
improvements to Dataset which are not possible while maintaining
Skyhook in its current form, we need volunteers to update Skyhook.
Please reply to let us know if you are actively using Skyhook or if
you are interested in helping to refactor Skyhook.

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/


Re: [DISCUSS][C++] Help needed to refactor Skyhook

2024-03-18 Thread Benjamin Kietzman
I'll put recommendations for the design on the issue. Thanks!

On Fri, Mar 15, 2024 at 2:03 PM Aldrin  wrote:

> I created a new issue [1] to track the refactoring. Could you clarify the
> request (here or in the issue)?
>
> My understanding is that the Skyhook file format code [2] should be
> refactored to use a higher-level interface rather than using
> dataset::FileFormat and dataset::FragmentScanOptions directly [3].
>
> I am assuming the reference to Acero and Substrait to be only for context
> and not necessarily a preferred direction. If that is the preferred
> direction, there is something much more general in progress that we can
> perhaps specialize as a replacement for the Skyhook file format, but I'm
> not sure that's what's actually being requested.
>
> Thank you!
>
>
> [1]: https://github.com/apache/arrow/issues/40583
> [2]: https://github.com/apache/arrow/tree/main/cpp/src/skyhook
> [3]:
> https://github.com/apache/arrow/blob/main/cpp/src/skyhook/cls/cls_skyhook.cc#L153-L156
>
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Thursday, March 14th, 2024 at 09:10, Jayjeet Chakraborty <
> jayjeetchakrabort...@gmail.com> wrote:
>
> > Hi Ben, I am willing to help out with the refactor too !
> >
>
> > On Wed, Mar 13, 2024 at 9:25 PM Aldrin octalene....@pm.me.invalid wrote:
> >
>
> > > I am interested in helping to refactor!
> > >
>
> > > -Aldrin
> > >
>
> > > On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman  > > >
> wrote:
> > >
>
> > > Skyhook [1] enables efficient predicate and projection pushdown from
> > > Arrow Dataset to a Ceph storage cluster. This is very cool
> > > functionality, but it's tightly coupled to the Arrow C++ Dataset
> > > implementation in a way which blocks refactoring. In the Arrow C++
> > > codebase today, Acero is designed specifically to handle projection
> > > and filtration in a more modular fashion, and to accept configuration
> > > from standardized plan/expression formats like Substrait. In light of
> > > improvements to Dataset which are not possible while maintaining
> > > Skyhook in its current form, we need volunteers to update Skyhook.
> > > Please reply to let us know if you are actively using Skyhook or if
> > > you are interested in helping to refactor Skyhook.
> > >
>
> > > Sincerely,
> > > Ben Kietzman
> > >
>
> > > [1]
> > >
>
> > >
> https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/
> >
>
> >
>
> > --
> > Jayjeet Chakraborty
> > CS PhD student
> > UC Santa Cruz
> > California, USA


Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-03-27 Thread Benjamin Kietzman
+1

On Tue, Mar 26, 2024, 18:36 Matt Topol  wrote:

> Should I start a new thread for a new vote? Or repeat the original vote
> email here?
>
> Just asking since there hasn't been any responses so far.
>
> --Matt
>
> On Thu, Mar 21, 2024 at 11:46 AM Matt Topol 
> wrote:
>
> > Absolutely, it will be marked experimental until we see some people using
> > it and can get more real-world feedback.
> >
> > There's also already a couple things that will be followed-up on after
> the
> > initial adoption for expansion which were discussed in the comments.
> >
> > On Thu, Mar 21, 2024, 11:42 AM David Li  wrote:
> >
> >> I think let's try again. Would it be reasonable to declare this
> >> 'experimental' for the time being, just as we did with Flight/Flight
> >> SQL/etc?
> >>
> >> On Tue, Mar 19, 2024, at 15:24, Matt Topol wrote:
> >> > Hey All, It's been another month and we've gotten a whole bunch of
> >> feedback
> >> > and engagement on the document from a variety of individuals. Myself
> >> and a
> >> > few others have proactively attempted to reach out to as many third
> >> parties
> >> > as we could, hoping to pull more engagement also. While it would be
> >> great
> >> > to get even more feedback, the comments have slowed down and we
> haven't
> >> > gotten anything in a few days at this point.
> >> >
> >> > If there's no objections, I'd like to try to open up for voting again
> to
> >> > officially adopt this as a protocol to add to our docs.
> >> >
> >> > Thanks all!
> >> >
> >> > --Matt
> >> >
> >> > On Sat, Mar 2, 2024 at 6:43 PM Paul Whalen 
> wrote:
> >> >
> >> >> Agreed that it makes sense not to focus on in-place updating for this
> >> >> proposal.  I’m not even sure it’s a great fit as a “general purpose”
> >> Arrow
> >> >> protocol, because of all the assumptions and restrictions required as
> >> you
> >> >> noted.
> >> >>
> >> >> I took another look at the proposal and don’t think there’s anything
> >> >> preventing in-place updating in the future - ultimately the data body
> >> could
> >> >> just be in the same location for subsequent messages.
> >> >>
> >> >> Thanks!
> >> >> Paul
> >> >>
> >> >> On Fri, Mar 1, 2024 at 5:28 PM Matt Topol 
> >> wrote:
> >> >>
> >> >> > > @pgwhalen: As a potential "end user developer," (and aspiring
> >> >> > contributor) this
> >> >> > immediately excited me when I first saw it.
> >> >> >
> >> >> > Yay! Good to hear that!
> >> >> >
> >> >> > > @pgwhalen: And it wasn't clear to me whether updating batches in
> >> >> > place (and the producer/consumer coordination that comes with that)
> >> was
> >> >> > supported or encouraged as part of the proposal.
> >> >> >
> >> >> > So, updating batches in place was not a particular use-case we were
> >> >> > targeting with this approach. Instead using shared memory to
> produce
> >> and
> >> >> > consume the buffers/batches without having to physically copy the
> >> data.
> >> >> > Trying to update a batch in place is a dangerous prospect for a
> >> number of
> >> >> > reasons, but as you've mentioned it can technically be made safe if
> >> the
> >> >> > shape is staying the same and you're only modifying fixed-width
> data
> >> >> types
> >> >> > (i.e. not only is the *shape* unchanged but the sizes of the
> >> underlying
> >> >> > data buffers are also remaining unchanged). The producer/consumer
> >> >> > coordination that would be needed for updating batches in place is
> >> not
> >> >> part
> >> >> > of this proposal but is definitely something we can look into as a
> >> >> > follow-up to this for extending it. There's a number of discussions
> >> that
> >> >> > would need to be had around that so I don't want to add on another
> >> >> > complexity to this already complex proposal.
> >> >> >
> >> >> > That said, if you or anyone see something in this proposal that
> would
> >> >> > hinder or prevent being able to use it for your use case please let
> >> me
> >> >> know
> >> >> > so we can address it. Even though the proposal as it currently
> exists
> >> >> > doesn't fully support the in-place updating of batches, I don't
> want
> >> to
> >> >> > make things harder for us in such a follow-up where we'd end up
> >> requiring
> >> >> > an entirely new protocol to support that.
> >> >> >
> >> >> > > @octalene.dev: I know of a third party that is interested in
> >> Arrow for
> >> >> > HPC environments that could be interested in the proposal and I can
> >> see
> >> >> if
> >> >> > they're interested in providing feedback.
> >> >> >
> >> >> > Awesome! Thanks much!
> >> >> >
> >> >> >
> >> >> > For reference to anyone who hasn't looked at the document in a
> while,
> >> >> since
> >> >> > the original discussion thread on this I have added a full
> >> "Background
> >> >> > Context" page to the beginning of the proposal to help anyone who
> >> isn't
> >> >> > already familiar with the issues this protocol is trying to solve
> or
> >> >> isn't
> >> >> > already familiar with ucx or libfabric transports to better
> >> understand
> 

Re: [C++][Discuss] Switch to C++14

2021-05-27 Thread Benjamin Kietzman
I'm definitely in favor of going to c++14.

While we're at it: which platforms prevent us from using c++17?

On Thu, May 27, 2021, 04:03 Antoine Pitrou  wrote:

>
> Hello,
>
> It seems the only two platforms that constrained us to C++11 will not be
> supported anymore (those platforms are RTools 3.5 for R packages, and
> manylinux1 for Python packages).
>
> It would be beneficial to bump our C++ requirement to C++14.  There is
> an issue open listing benefits:
> https://issues.apache.org/jira/browse/ARROW-12816
>
> An additional benefit is that some useful third-party libraries for us
> may or will require C++14, including in their headers.
>
> Is anyone opposed to doing the switch?  Please speak up.
>
> Best regards
>
> Antoine.
>


Re: C++ Migrate from Arrow 0.16.0

2021-05-27 Thread Benjamin Kietzman
unique_ptr is used to designate unique ownership of the buffer
just created. It's fairly compatible with shared_ptr since
unique_ptr can convert implicitly to shared_ptr.

One other refactoring in play here: we've been moving from
Status-returning-out-argument functions to the more ergonomic
Result. I'd recommend you write a new macro for dealing with
Results, like:

#define ASSIGN_OR_THROW_IMPL(result_name, lhs, rexpr) \
auto&& result_name = (rexpr); \
THROW_NOT_OK((result_name).status()); \
lhs = std::move(result_name).ValueUnsafe();
#define ASSIGN_OR_THROW(lhs, rexpr) \
ASSIGN_OR_THROW_IMPL(_maybe ## __COUNTER__, lhs, rexpr)

Then lines such as
https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L196
can be rewritten as:

ASSIGN_OR_THROW(buffer, arrow::AllocateBuffer(length));

Does that help?

On Thu, May 27, 2021 at 3:47 PM Rares Vernica  wrote:

> Hello,
>
> We are trying to migrate from Arrow 0.16.0 to a newer version, hopefully up
> to 4.0.0. The Arrow 0.17.0 change in AllocateBuffer from taking a
> shared_ptr to returning a unique_ptr is making things very
> difficult. We wonder if there is a strong reason behind the change from
> shared_ptr to unique_ptr and if there is an easier path forward for us.
>
> In our code, we interchangeably use Buffer and ResizableBuffer. We pass
> around these pointers across a number of classes. They are allocated or
> resized here
> https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L191
> Moreover,
> we cast the ResizableBuffer instance to Buffer in order to have all our
> methods only deal with Buffer, here
> https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L151
>
> In Arrow 0.16.0 AllocateBuffer took a shared_ptr and this works
> fine. In Arrow 0.17.0 AllocateBuffer returns a unique_ptr. Our cast
> from ResizableBuffer to Buffer won't work on unique_ptr and we won't be
> able to pass the Buffer around so easily.
>
> I noticed that there is another AllocateBuffer in MemoryManger that returns
> a shared_ptr.
>
> https://arrow.apache.org/docs/cpp/api/memory.html?highlight=resizablebuffer#_CPPv4N5arrow13MemoryManager14AllocateBufferE7int64_t
> Is this a better alternative to allocate a buffer? Is there a similar
> method to allocate a resizable buffer?
>
> Thank you,
> Rares
>


Re: C++ Migrate from Arrow 0.16.0

2021-05-27 Thread Benjamin Kietzman
Yes this is an adaptation of ARROW_ASSIGN_OR_RAISE for
their bridge, which seems to throw exceptions instead of returning
Status/Result

On Thu, May 27, 2021 at 4:42 PM Micah Kornfield 
wrote:

> For the macro, I believe ARROW_ASSIGN_OR_RAISE already does this?
>
> On Thu, May 27, 2021 at 1:19 PM Benjamin Kietzman 
> wrote:
>
> > unique_ptr is used to designate unique ownership of the buffer
> > just created. It's fairly compatible with shared_ptr since
> > unique_ptr can convert implicitly to shared_ptr.
> >
> > One other refactoring in play here: we've been moving from
> > Status-returning-out-argument functions to the more ergonomic
> > Result. I'd recommend you write a new macro for dealing with
> > Results, like:
> >
> > #define ASSIGN_OR_THROW_IMPL(result_name, lhs, rexpr) \
> > auto&& result_name = (rexpr); \
> > THROW_NOT_OK((result_name).status()); \
> > lhs = std::move(result_name).ValueUnsafe();
> > #define ASSIGN_OR_THROW(lhs, rexpr) \
> > ASSIGN_OR_THROW_IMPL(_maybe ## __COUNTER__, lhs, rexpr)
> >
> > Then lines such as
> > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L196
> > can be rewritten as:
> >
> > ASSIGN_OR_THROW(buffer, arrow::AllocateBuffer(length));
> >
> > Does that help?
> >
> > On Thu, May 27, 2021 at 3:47 PM Rares Vernica 
> wrote:
> >
> > > Hello,
> > >
> > > We are trying to migrate from Arrow 0.16.0 to a newer version,
> hopefully
> > up
> > > to 4.0.0. The Arrow 0.17.0 change in AllocateBuffer from taking a
> > > shared_ptr to returning a unique_ptr is making things
> > very
> > > difficult. We wonder if there is a strong reason behind the change from
> > > shared_ptr to unique_ptr and if there is an easier path forward for us.
> > >
> > > In our code, we interchangeably use Buffer and ResizableBuffer. We pass
> > > around these pointers across a number of classes. They are allocated or
> > > resized here
> > > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L191
> > > Moreover,
> > > we cast the ResizableBuffer instance to Buffer in order to have all our
> > > methods only deal with Buffer, here
> > > https://github.com/Paradigm4/bridge/blob/master/src/Driver.h#L151
> > >
> > > In Arrow 0.16.0 AllocateBuffer took a shared_ptr and this works
> > > fine. In Arrow 0.17.0 AllocateBuffer returns a unique_ptr. Our
> > cast
> > > from ResizableBuffer to Buffer won't work on unique_ptr and we won't be
> > > able to pass the Buffer around so easily.
> > >
> > > I noticed that there is another AllocateBuffer in MemoryManger that
> > returns
> > > a shared_ptr.
> > >
> > >
> >
> https://arrow.apache.org/docs/cpp/api/memory.html?highlight=resizablebuffer#_CPPv4N5arrow13MemoryManager14AllocateBufferE7int64_t
> > > Is this a better alternative to allocate a buffer? Is there a similar
> > > method to allocate a resizable buffer?
> > >
> > > Thank you,
> > > Rares
> > >
> >
>


Re: [C++] [DISCUSS] Moving towards a consistent enum naming scheme

2021-06-04 Thread Benjamin Kietzman
My two cents:

kSomeConstant includes the prefix 'k' so that regardless of what expression
it appears in, it is unambiguously a global constant. I'm restating the
obvious here to clarify that the goal is making names obviously global.

Note that a prefix is redundant for any qualified name (appears to the
right of ::). For example, Type::Boolean unambiguously refers to a global-
static data member or a scoped enum value, since 'Type' is not a legal
namespace.

In light of this my preference would be: public enums must be scoped and
named in CamelCase. The style guide would require kCamelCase but it does so
for consistency with public _unscoped_ enums. IMO we'd be better served by
requiring scoped enums, which obviates the 'k' prefix.

Static data members may be referenced without qualification within the
class' scope, so to ensure they are obviously global even there the 'k'
prefix should still be applied.


Re: [C++][DISCUSS] Implementing interpreted (non-compiled) tests for compute functions

2021-06-08 Thread Benjamin Kietzman
I've added https://issues.apache.org/jira/browse/ARROW-13013
to track moving kernel unit tests to Python since that seems easily
doable and worthwhile

On Sun, May 16, 2021 at 3:35 PM Wes McKinney  wrote:

> I agree there are pros and cons here (up front investment hopefully
> yielding future productivity gains). If a test harness and format
> meeting the requirements could be created without too much pain (I'm
> thinking less than a week of full time effort, including refactoring
> some existing tests), it would save developers a lot of time going
> forward. I look for example at the large number of hand-coded
> functional tests found in this file as one example to use to guide the
> effort:
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
>
> On Sat, May 15, 2021 at 5:02 AM Antoine Pitrou  wrote:
> >
> >
> > I think people who think this would be beneficial should try to devise a
> > text format to represent compute test data.  As Eduardo pointed out,
> > there are various complications that need to be catered for.
> >
> > To me, it's not obvious that building the necessary infrastructure in
> > C++ to ingest that text format will be more pleasant than our current
> > way of writing tests.  As a data point, the JSON integration code in C++
> > is really annoying to maintain.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 15/05/2021 à 00:03, Wes McKinney a écrit :
> > > In C++, we have the "ArrayFromJSON" function which is an even simpler
> > > way of specifying input data compared with the integration tests.
> > > That's one possible starting point.
> > >
> > > The "interpreted tests" could be all specified and driven by minimal
> > > dependency Python code, as one possible way to approach things.
> > >
> > > On Fri, May 14, 2021 at 1:57 PM Jorge Cardoso Leitão
> > >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> (this problem also exists in Rust, btw)
> > >>
> > >> Couldn't we use something like we do for our integration tests?
> Create a
> > >> separate binary that would allow us to call e.g.
> > >>
> > >> test-compute --method equal --json-file  --arg "column1" --arg
> > >> "column2" --expected "column3"
> > >>
> > >> (or simply pass the input via stdin)
> > >>
> > >> and then use Python to call the binary?
> > >>
> > >> The advantage I see here is that we would compile the binary with
> flags to
> > >> disable unnecessary code, use debug, etc, thereby reducing compile
> time if
> > >> the kernel needs
> > >> to be changed.
> > >>
> > >> IMO our "json integration" format is a reliable way of passing data
> across,
> > >> it is very easy to read and write, and all our implementations can
> already
> > >> read it for integration tests.
> > >>
> > >> wrt to the "cross implementation", the equality operation seems a
> natural
> > >> candidate for across implementations checks, as that one has important
> > >> implications in all our integration tests. filter, take, slice,
> boolean ops
> > >> may also be easy to agree upon. "add" and the like are a bit more
> difficult
> > >> due to how overflow should be handled (abort vs saturate vs None), but
> > >> nothing that we can't take. ^_^
> > >>
> > >> Best,
> > >> Jorge
> > >>
> > >> On Fri, May 14, 2021 at 8:25 PM David Li  wrote:
> > >>
> > >>> I think even if it's not (easily) generalizable across languages,
> it'd
> > >>> still be a win for C++ (and hopefully languages that bind to
> > >>> C++). Also, I don't think they're meant to completely replace
> > >>> language-specific tests, but rather complement them, and make it
> > >>> easier to add and maintain tests in the overwhelmingly common case.
> > >>>
> > >>> I do feel it's somewhat painful to write these kinds of tests in C++,
> > >>> largely because of the iteration time and the difficulty of repeating
> > >>> tests across various configurations. I also think this could be an
> > >>> opportunity to leverage things like Hypothesis/property-based testing
> > >>> or perhaps fuzzing to make the kernels even more robust.
> > >>>
> > >>> -David
> > >>>
> > >>> On 2021/05/14 18:09:45, Eduardo Ponce  wrote:
> >  Another aspect to keep in mind is that some tests require internal
> > >>> options
> >  to be changed before executing the compute functions (e.g., check
> > >>> overflow,
> >  allow NaN comparisons, change validity bits, etc.). Also, there are
> tests
> >  that take randomized inputs and others make use of the min/max
> values for
> >  each specific data type. Certainly, these details can be generalized
> > >>> across
> >  languages/testing frameworks but not without careful treatment.
> > 
> >  Moreover, each language implementation still needs to test
> >  language-specific or internal functions, so having a meta test
> framework
> >  will not necessarily get rid of language-specific tests.
> > 
> >  ~Eduardo
> > 
> >  On Fri, May 14, 2021 at 1:56 PM Weston Pace 
> > >>> wrote:
> > 
> > >

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Benjamin Kietzman
As a workaround, the "fill_null" compute function can be used to replace
nulls with nans:

>>> nan = pa.scalar(np.NaN, type=pa.float64())
>>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()

On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Li,
>
> It's correct that arrow uses "None" for null values when converting a
> string array to numpy / pandas.
> As far as I am aware, there is currently no option to control that
> (and to make it use np.nan instead), and I am not sure there would be
> much interest in adding such an option.
>
> Now, I know this doesn't give an exact roundtrip in this case, but
> pandas does treat both np.nan and None as missing values in object
> dtype columns, so behaviour-wise this shouldn't give any difference
> and the roundtrip is still faithful on that aspect.
>
> Best,
> Joris
>
> On Tue, 8 Jun 2021 at 21:59, Li Jin  wrote:
> >
> > Hello!
> >
> > Apologies if this has been brought before. I'd like to get devs' thoughts
> > on this potential inconsistency of "what are the python objects for null
> > values" between pandas and pyarrow.
> >
> > Demonstrated with the following example:
> >
> > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> pandas
> > 1.2.4):
> >
> > In [*32*]: df
> >
> > Out[*32*]:
> >
> >value
> >
> > key
> >
> > 1some_strign
> >
> >
> > In [*33*]: df2
> >
> > Out[*33*]:
> >
> > value2
> >
> > key
> >
> > 2some_other_string
> >
> >
> > In [*34*]: df.join(df2)
> >
> > Out[*34*]:
> >
> >value value2
> >
> > key
> >
> > 1some_strign*NaN*
> >
> >
> >
> > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> >
> > >>> s = pd.Series(["some_string", np.NaN])
> >
> > >>> s
> >
> > 0some_string
> >
> > 1NaN
> >
> > dtype: object
> >
> > >>> pa.Array.from_pandas(s).to_pandas()
> >
> > 0some_string
> >
> > 1   None
> >
> > dtype: object
> >
> >
> > I have looked around the pyarrow doc and didn't find an option to use
> > np.NaN for null values with to_pandas so it's a bit hard to get around
> trip
> > consistency.
> >
> >
> > I appreciate any thoughts on this as to how to achieve consistency here.
> >
> >
> > Thanks!
> >
> > Li
>


Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Benjamin Kietzman
One improvement in read/writability which might be my favorite is the
removal of SFINAE-controlled template instantiation in favor of compile
time branching with `if constexpr`. Here's an example of that in the draft
PR:

https://github.com/apache/arrow/pull/10414/files#diff-058e32693ee8820a3d8967404e05c76b37ef2f646245f794fa4acb6d26703668R241

This one at least is pretty easy to quantify ("how many instances of SFINAE
could we potentially replace?"):

$ rg enable_if | wc
612

On Wed, Jun 9, 2021 at 1:36 PM Antoine Pitrou  wrote:

>
> Le 09/06/2021 à 19:25, Eduardo Ponce a écrit :
> >
> > Measurable metrics:
> > * code size (source and binary) - measured in bytes
> [...]
> >
> > Qualitative metrics:
> > * code structure/maintainability - how would it improve development?
> > * code readability - ease of understanding details for new/current
> > contributors?
>
> These are a bit difficult to evaluate, because a pervasive upgrade to
> more modern C++ idioms would probably bring significant maintenance and
> readability improvements, but it's unlikely to be achieved *before* the
> upgrade is decided (as it would be quite a bit of work).  In the
> submitted PR, we've upgraded a couple places as a validation that C++17
> does improve ease of writing and reading code, but it is unknown how
> many other places could benefit.
>
> Regards
>
> Antoine.
>


Re: [VOTE] Arrow should state a convention for encoding instants as Timestamp with "UTC" as the time zone

2021-06-30 Thread Benjamin Kietzman
+1

On Wed, Jun 30, 2021, 08:54 David Li  wrote:

> +1
>
> On Wed, Jun 30, 2021, at 08:46, Wes McKinney wrote:
> > +1
> >
> > On Wed, Jun 30, 2021 at 12:03 PM Antoine Pitrou 
> wrote:
> > >
> > > +1
> > >
> > >
> > > Le 30/06/2021 à 11:52, Weston Pace a écrit :
> > > > This vote is a result of previous discussion[1][2].  This vote is
> also
> > > > a prerequisite for the PR in [5].
> > > >
> > > > ---
> > > > Some date & time libraries have three temporal concepts.  For the
> sake
> > > > of this document we will call them LocalDateTime, ZonedDateTime, and
> > > > Instant.  An Instant is a timestamp that has no meaningful reference
> > > > time zone (e.g. events that did not occur on Earth or columns of
> > > > timestamps spanning more than one time zone). For more extensive
> > > > definitions and a discussion of their semantics and uses see [3].
> > > > Currently Arrow describes how to encode two of these three concepts
> > > > into a Timestamp column and there is no guideline on how to store an
> > > > Instant.
> > > >
> > > >
> > > > This proposal states that Arrow should recommend that instants be
> encoded
> > > > into timestamp columns by setting the timezone string to "UTC".
> > > > ---
> > > >
> > > > For sample arguments (currently grouped as "for changing schema.fbs"
> > > > and "against changing schema.fbs") see [4].  For a detailed
> definition
> > > > of the terms LocalDateTime, ZonedDateTime, and Instant and a
> > > > discussion of their semantics see [3].  For a straw poll on
> > > > possible ways to handle instants see [2].
> > > >
> > > > This vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Update schema.fbs to state the above convention
> > > > [ ] +0
> > > > [ ] -1 Do not make any change
> > > >
> > > > [1]
> https://lists.apache.org/thread.html/r8216e5de3efd2935e3907ad9bd20ce07e430952f84de69b36337e5eb%40%3Cdev.arrow.apache.org%3E
> > > > [2]
> https://lists.apache.org/thread.html/r1bdffc76537ae9c12c37396880087fee9c0eec9000bf6ed4c9850c44%40%3Cdev.arrow.apache.org%3E
> > > > [3]
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > > [4]
> https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing
> > > > [5] https://github.com/apache/arrow/pull/10629
> > > >
> >


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Benjamin Kietzman
Congrats!

On Fri, Jul 9, 2021, 08:48 Wes McKinney  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted
> an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
>
> Wes
>


Re: [C++] [DISCUSS] Moving towards a consistent enum naming scheme

2021-07-12 Thread Benjamin Kietzman
Relevant to this thread: https://github.com/apache/arrow/pull/10691
adds an enum replacement which includes compile-time to/from strings.
This should reduce repetitive boilerplate in the bindings and elsewhere
mapping to/from user provided string values.

On Mon, Jun 14, 2021 at 11:17 PM Micah Kornfield 
wrote:

> >
> > In light of this my preference would be: public enums must be scoped and
> > named in CamelCase. The style guide would require kCamelCase but it does
> so
> > for consistency with public _unscoped_ enums. IMO we'd be better served
> by
> > requiring scoped enums, which obviates the 'k' prefix.
>
>
> This seems to just add more rules to remember for not that much gain (and
> an additional exception to the style guide).
>
> On Fri, Jun 4, 2021 at 8:48 PM Benjamin Kietzman 
> wrote:
>
> > My two cents:
> >
> > kSomeConstant includes the prefix 'k' so that regardless of what
> expression
> > it appears in, it is unambiguously a global constant. I'm restating the
> > obvious here to clarify that the goal is making names obviously global.
> >
> > Note that a prefix is redundant for any qualified name (appears to the
> > right of ::). For example, Type::Boolean unambiguously refers to a
> global-
> > static data member or a scoped enum value, since 'Type' is not a legal
> > namespace.
> >
> > In light of this my preference would be: public enums must be scoped and
> > named in CamelCase. The style guide would require kCamelCase but it does
> so
> > for consistency with public _unscoped_ enums. IMO we'd be better served
> by
> > requiring scoped enums, which obviates the 'k' prefix.
> >
> > Static data members may be referenced without qualification within the
> > class' scope, so to ensure they are obviously global even there the 'k'
> > prefix should still be applied.
> >
>


[C++][DISCUSS][STRAW POLL] Enum metaprogramming

2021-07-22 Thread Benjamin Kietzman
Enumerations are frequently useful constructs, but carry some overhead
when used with bindings. Specifically mapping the integral values to string
representations and back and validation of both integral values and strings
requires boilerplate, frequently replicated across multiple languages.

https://github.com/apache/arrow/pull/10712 would replace some of the enums
defined in arrow/compute/api* with a metaprogramming construct which
includes validation, stringification, and parsing from string.

The question at hand is: is it too confusing for users of the C++ API to
replace familiar enums with anything else?

***

For example, one replacement from that PR is

/// - `emit_null`: a null in any input results in a null in the output.
/// - `skip`: nulls in inputs are skipped.
/// - `replace`: nulls in inputs are replaced with the replacement
string.
struct NullHandlingBehavior : public EnumType {
  using EnumType::EnumType;
  static constexpr EnumStrings<3> values() { return {"emit_null",
"skip", "replace"}; }
  static constexpr const char* name() { return "NullHandlingBehavior"; }
};

which replaces (not shown: deleted boilerplate for stringification,
validation, etc in Python and C++):

enum NullHandlingBehavior {
  /// A null in any input results in a null in the output.
  EMIT_NULL,
  /// Nulls in inputs are skipped.
  SKIP,
  /// Nulls in inputs are replaced with the replacement string.
  REPLACE,
  };

Values from an EnumType are constructed at compile time from a string,
so for example we don't lose the ability to write an expressive switch over
their values:

switch (*null_handling) {
  case *NullHandlingBehavior("emit_null"):
// ...
}


Re: [C++][DISCUSS][STRAW POLL] Enum metaprogramming

2021-07-23 Thread Benjamin Kietzman
Alright in light of this I'll rewrite that PR to a traits based approach.
Thanks all

On Fri, Jul 23, 2021 at 9:12 AM Wes McKinney  wrote:

> Like Antoine, I am not sure how comfortable I am with this the way it
> is now. On one hand I see the benefits in the reduction of
> boilerplate. On the other, making the code more "magical" through meta
> constructs likely makes it less accessible to contributors. It also
> makes it so that IDEs (Clion, VSCode, Visual Studio, or anything using
> clang-powered LSP) won't be able to autocomplete enums (please correct
> me if this is not true) — this alone to me is reason enough not to
> make these changes (particularly the new way of writing switch / if
> statements).
>
> On Thu, Jul 22, 2021 at 1:49 PM Felipe Aramburu 
> wrote:
> >
> > It feels like not using regular enums would be optimizing more for the
> > producers of the library than the users that will be consuming it.  It
> > could definitely reduce the burden of writing some boiler plate code but
> > this means that everyone that wants to use Arrow has another concept they
> > have to learn about in order to get started.
> >
> > ~felipe
> >
> >
> >
> > On Thu, Jul 22, 2021 at 11:52 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > As I said on the PR, I'm worried that it will raise eyebrows or even
> > > make people defiant about using our C++ APIs.  Basically, users of a
> > > language are accustomed to certain idioms and often wary of software
> > > packages that redefine their own variants of core language
> > > functionality.  Even I find it slightly unsettling to write switch/case
> > > statements based on pointer values of literal strings...
> > >
> > > This is a social problem (on the technical front, the change seems ok,
> > > and I trust Benjamin that the underlying infrastructure is sound).
> > >
> > > This is why I would favor a mechanism where the reflection abilities
> are
> > > defined "on the side" (using e.g. a EnumTraits<...>), while the public
> > > C++ APIs still use normal enums.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 22/07/2021 à 16:50, Benjamin Kietzman a écrit :
> > > > Enumerations are frequently useful constructs, but carry some
> overhead
> > > > when used with bindings. Specifically mapping the integral values to
> > > string
> > > > representations and back and validation of both integral values and
> > > strings
> > > > requires boilerplate, frequently replicated across multiple
> languages.
> > > >
> > > > https://github.com/apache/arrow/pull/10712 would replace some of the
> > > enums
> > > > defined in arrow/compute/api* with a metaprogramming construct which
> > > > includes validation, stringification, and parsing from string.
> > > >
> > > > The question at hand is: is it too confusing for users of the C++
> API to
> > > > replace familiar enums with anything else?
> > > >
> > > > ***
> > > >
> > > > For example, one replacement from that PR is
> > > >
> > > >  /// - `emit_null`: a null in any input results in a null in the
> > > output.
> > > >  /// - `skip`: nulls in inputs are skipped.
> > > >  /// - `replace`: nulls in inputs are replaced with the
> replacement
> > > > string.
> > > >  struct NullHandlingBehavior : public
> EnumType
> > > {
> > > >using EnumType::EnumType;
> > > >static constexpr EnumStrings<3> values() { return
> {"emit_null",
> > > > "skip", "replace"}; }
> > > >static constexpr const char* name() { return
> > > "NullHandlingBehavior"; }
> > > >  };
> > > >
> > > > which replaces (not shown: deleted boilerplate for stringification,
> > > > validation, etc in Python and C++):
> > > >
> > > >  enum NullHandlingBehavior {
> > > >/// A null in any input results in a null in the output.
> > > >EMIT_NULL,
> > > >/// Nulls in inputs are skipped.
> > > >SKIP,
> > > >/// Nulls in inputs are replaced with the replacement string.
> > > >REPLACE,
> > > >};
> > > >
> > > > Values from an EnumType are constructed at compile time from a
> string,
> > > > so for example we don't lose the ability to write an expressive
> switch
> > > over
> > > > their values:
> > > >
> > > >  switch (*null_handling) {
> > > >case *NullHandlingBehavior("emit_null"):
> > > >  // ...
> > > >  }
> > > >
> > >
>


Re: [VOTE] Release Apache Arrow 5.0.0 - RC1

2021-07-24 Thread Benjamin Kietzman
+1

verified C++ and Python on Windows 10 with:
dev\release\verify-release-candidate.bat 5.0.0 1

On Thu, Jul 22, 2021 at 11:25 PM Krisztián Szűcs 
wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of Apache
> Arrow version 5.0.0. This is a release consisting of 551
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 4591d76fce2846a29dac33bf01e9ba0337b118e9 [2]
>
> The source release rc1 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9].
> The changelog is located at [10].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [11] for how to validate a release candidate.
>
> Note, please use [12] to verify the Amazon Linux and CentOS packages.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 5.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 5.0.0 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%205.0.0
> [2]:
> https://github.com/apache/arrow/tree/4591d76fce2846a29dac33bf01e9ba0337b118e9
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-5.0.0-rc1
> [4]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/5.0.0-rc1
> [8]: https://apache.jfrog.io/artifactory/arrow/python-rc/5.0.0-rc1
> [9]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [10]:
> https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/CHANGELOG.md
> [11]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [12]: https://github.com/apache/arrow/pull/10786
>


Re: [VOTE] Release Apache Arrow 5.0.0 - RC1

2021-07-25 Thread Benjamin Kietzman
Thanks for that clarification, Matt. Is there a Jira for getting tags right
for Go?

On Sun, Jul 25, 2021, 12:51 Matt Topol  wrote:

> Because the release process is still not tagging the releases appropriately
> for Go, I don't think it's necessary to wait because as soon as this is
> merged to master it will be available for consumption via go get -u
>
> For go, tags need to be added with the release in the form of
> "go/arrow/v5.0.0" to get it to be recognized since the go.mod isn't at the
> root
>
> On Sun, Jul 25, 2021, 12:31 PM Mauricio Vargas 
> wrote:
>
> > +1 I AGREE
> >
> > On Sun, Jul 25, 2021 at 10:04 AM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > wrote:
> >
> > > On Sat, Jul 24, 2021 at 6:46 PM Matt Topol 
> > wrote:
> > > >
> > > > So, looking at the error message in the Go integration tests and
> > looking
> > > at
> > > > the stack trace, I was able to confirm the bug. I'm not sure why it
> > > showed
> > > > up in that run but not subsequent / reproducible but the issue comes
> > down
> > > > to the following:
> > > >
> > > > In decimal128.go in the `FromBigInt` function you have this:
> > > >
> > > > b := v.Bits()
> > > > > n.lo = uint64(b[0])
> > > > >
> > > >
> > > > Turns out that if the passed in `v` is 0, then the length of b is
> also
> > 0,
> > > > so this ends up attempting to access index 0 of an empty slice.
> Simple
> > > fix
> > > > being to return early with 0 if v.BitLen() == 0.
> > > >
> > > > I've put up https://github.com/apache/arrow/pull/10796 to fix the
> > issue.
> > > Thanks Matt for taking a look at this and also for the patch.
> > >
> > > Shall we cut another RC or we could live with this bug until the next
> > > patch release?
> > > >
> > > > On Sat, Jul 24, 2021 at 5:17 AM Yibo Cai  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Verified C++ and Python on Arm64 Linux (Ubuntu 20.04, aarch64).
> > > > >
> > > > > ARROW_CMAKE_OPTIONS="-DCMAKE_CXX_COMPILER=/usr/bin/clang++-10
> > > > > -DCMAKE_C_COMPILER=/usr/bin/clang-10" TEST_DEFAULT=0 TEST_SOURCE=1
> > > > > TEST_CPP=1 TEST_PYTHON=1 dev/release/verify-release-candidate.sh
> > source
> > > > > 5.0.0 1
> > > > >
> > > > > On 7/23/21 11:25 AM, Krisztián Szűcs wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I would like to propose the following release candidate (RC1) of
> > > Apache
> > > > > > Arrow version 5.0.0. This is a release consisting of 551
> > > > > > resolved JIRA issues[1].
> > > > > >
> > > > > > This release candidate is based on commit:
> > > > > > 4591d76fce2846a29dac33bf01e9ba0337b118e9 [2]
> > > > > >
> > > > > > The source release rc1 is hosted at [3].
> > > > > > The binary artifacts are hosted at [4][5][6][7][8][9].
> > > > > > The changelog is located at [10].
> > > > > >
> > > > > > Please download, verify checksums and signatures, run the unit
> > tests,
> > > > > > and vote on the release. See [11] for how to validate a release
> > > > > candidate.
> > > > > >
> > > > > > Note, please use [12] to verify the Amazon Linux and CentOS
> > packages.
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Release this as Apache Arrow 5.0.0
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not release this as Apache Arrow 5.0.0 because...
> > > > > >
> > > > > > [1]:
> > > > >
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%205.0.0
> > > > > > [2]:
> > > > >
> > >
> >
> https://github.com/apache/arrow/tree/4591d76fce2846a29dac33bf01e9ba0337b118e9
> > > > > > [3]:
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-5.0.0-rc1
> > > > > > [4]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > > > > > [5]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > > > > > [6]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > > > > [7]:
> https://apache.jfrog.io/artifactory/arrow/nuget-rc/5.0.0-rc1
> > > > > > [8]:
> https://apache.jfrog.io/artifactory/arrow/python-rc/5.0.0-rc1
> > > > > > [9]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > > > > [10]:
> > > > >
> > >
> >
> https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/CHANGELOG.md
> > > > > > [11]:
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > > > > > [12]: https://github.com/apache/arrow/pull/10786
> > > > > >
> > > > >
> > >
> >
> >
> > --
> > —
> > *Mauricio 'Pachá' Vargas Sepúlveda*
> > Site: pacha.dev
> > Blog: pacha.dev/blog
> >
>


Re: C++ Datum::move returns ArrayData not Array

2021-07-27 Thread Benjamin Kietzman
Sorry, that is a typo. I will open a JIRA to fix the doc.

In the meantime, incremented_datum.make_array() should work for you

On Tue, Jul 27, 2021, 11:57 Rares Vernica  wrote:

> Hi,
>
> I'm trying the example in the Compute Functions user guide
> https://arrow.apache.org/docs/cpp/compute.html#invoking-functions
>
> std::shared_ptr numbers_array =
> ...;std::shared_ptr increment = ...;arrow::Datum
> incremented_datum;
> ARROW_ASSIGN_OR_RAISE(incremented_datum,
>   arrow::compute::CallFunction("add",
> {numbers_array, increment}));std::shared_ptr incremented_array
> = std::move(incremented_datum).array();
>
> and I'm getting this compilation error:
>
> error: conversion from 'const std::shared_ptr' to
> non-scalar type 'std::shared_ptr' requested
>  std::shared_ptr incremented_array =
> std::move(incremented_datum).array();
>
> I'm using Arrow 3.0.0. Is there a conversion I can make from ArrowData to
> Arrow. I need the result to be Arrow because I'm adding it as a new column
> to a RecordBatch using AddColumn.
>
> Thanks!
> Rares
>


Re: C++ Datum::move returns ArrayData not Array

2021-07-27 Thread Benjamin Kietzman
Opened https://issues.apache.org/jira/browse/ARROW-13462
to track correction of the doc's examples

On Tue, Jul 27, 2021 at 11:59 AM Benjamin Kietzman 
wrote:

> Sorry, that is a typo. I will open a JIRA to fix the doc.
>
> In the meantime, incremented_datum.make_array() should work for you
>
> On Tue, Jul 27, 2021, 11:57 Rares Vernica  wrote:
>
>> Hi,
>>
>> I'm trying the example in the Compute Functions user guide
>> https://arrow.apache.org/docs/cpp/compute.html#invoking-functions
>>
>> std::shared_ptr numbers_array =
>> ...;std::shared_ptr increment = ...;arrow::Datum
>> incremented_datum;
>> ARROW_ASSIGN_OR_RAISE(incremented_datum,
>>   arrow::compute::CallFunction("add",
>> {numbers_array, increment}));std::shared_ptr incremented_array
>> = std::move(incremented_datum).array();
>>
>> and I'm getting this compilation error:
>>
>> error: conversion from 'const std::shared_ptr' to
>> non-scalar type 'std::shared_ptr' requested
>>  std::shared_ptr incremented_array =
>> std::move(incremented_datum).array();
>>
>> I'm using Arrow 3.0.0. Is there a conversion I can make from ArrowData to
>> Arrow. I need the result to be Arrow because I'm adding it as a new column
>> to a RecordBatch using AddColumn.
>>
>> Thanks!
>> Rares
>>
>


Re: [C++][Discuss] Representing null union scalars

2021-07-29 Thread Benjamin Kietzman
Convention 2 seems more correct to me; if a UnionArray cannot
contain top level nulls then a UnionScalar should not be nullable.
Furthermore, I think that it's reasonable for
`MakeNullScalar(some_union_type)->is_valid` to be true, though
the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
should include explicit warnings of the special case which unions
represent.

It's worth noting that convention 1 doesn't round trip through the
scalar. Consider the type

t = dense_union({field("i", int8()), field("b", boolean())},
/*type_codes=*/{4, 8});

If we define an array of this type with a single element like so:

a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");

then the scalar returned by `a.GetScalar(0)` is one of:

conv_1 = { .is_valid = false, .value = nullptr }
conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }

... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
produce a correctly round tripped array with type_codes of 8. On the other
hand, broadcasting `conv_1` produces

ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");

(In general replacing the type code with whichever type code was
declared first.)



On Thu, Jul 29, 2021 at 7:00 AM Antoine Pitrou  wrote:

>
> Hello,
>
> The Scalar base class has a `bool is_valid` member that is used to
> represent null scalars for all types (including the null type).
>
> A UnionScalar, since it inherits from Scalar, has the following de facto
> structure:
>
> {
>// whether the scalar is null
>bool is_valid;
>// the underlying union value (only set if `is_valid` is true)
>std::shared_ptr value;
> }
>
> However, union arrays don't have a top-level validity bitmap.  A null in
> an union array is always encoded as a "valid" top-level entry with a
> null entry in the chosen child array.
>
> Therefore, there are two possible conventions for representing null
> union scalars:
>
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
>
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
> one the union's child types
>
> Advantages / drawbacks of convention 1
> --
>
> + consistency: `MakeNullScalar(union type)` always returns a scalar with
> `is_valid` set to false
>
> - `union_array.GetScalar(i)` can return a null scalar even when
> `union_array.IsNull(i)` would return false
>
> Advantages / drawbacks of convention 2
> --
>
> + more congruent with the union datatype definition
>
> + `union_array.GetScalar(i)` always returns a valid scalar, just like
> `union_array.IsNull(i)` always returns false
>
> - `MakeNullScalar(union type)` returns a valid scalar (or should it
> return an error??)
>
> - breaks compatibility, since the current behaviour follows convention 1
> (but UnionScalar is probably not widely used...)
>
>
> This came in the context of https://github.com/apache/arrow/pull/10817,
> but is unrelated to the changes in this PR.
>
> Thoughts? Opinions?
>
> Regards
>
> Antoine.
>


Re: [C++][Discuss] Representing null union scalars

2021-07-30 Thread Benjamin Kietzman
> > It's worth noting that convention 1 doesn't round trip

> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.

We'd need to add a special case to MakeArrayFromScalar, which currently
uses MakeArrayOfNull if !Scalar::is_valid
https://github.com/apache/arrow/blob/db4f23f2179540a490789e9715031578b4b91e16/cpp/src/arrow/array/util.cc#L741

However that's doable and then we would indeed have correct round
tripping under convention 1.



On Thu, Jul 29, 2021 at 11:37 AM Antoine Pitrou  wrote:

>
> Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > Convention 2 seems more correct to me; if a UnionArray cannot
> > contain top level nulls then a UnionScalar should not be nullable.
> > Furthermore, I think that it's reasonable for
> > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > should include explicit warnings of the special case which unions
> > represent.
> >
> > It's worth noting that convention 1 doesn't round trip through the
> > scalar. Consider the type
> >
> >  t = dense_union({field("i", int8()), field("b", boolean())},
> > /*type_codes=*/{4, 8});
> >
> > If we define an array of this type with a single element like so:
> >
> >  a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> >
> > then the scalar returned by `a.GetScalar(0)` is one of:
> >
> >  conv_1 = { .is_valid = false, .value = nullptr }
> >  conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> >
> > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > produce a correctly round tripped array with type_codes of 8. On the
> other
> > hand, broadcasting `conv_1` produces
> >
> >  ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> >
> > (In general replacing the type code with whichever type code was
> > declared first.)
>
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
>
> Regards
>
> Antoine.
>


[DISCUSS] Dropping support for Visual Studio 2015

2021-08-09 Thread Benjamin Kietzman
MSVC 19.0 is buggy enough that I for one have spent multiple days
reworking code that is fine for all other compilers we test against.
Most recently in the context of https://github.com/apache/arrow/pull/10793
(ARROW-13482) I found that for some types T,
`std::is_convertible::value` will be false. This necessitated the
following
(very hacky) workaround:

https://github.com/apache/arrow/pull/10793/commits/c44be29686af6fab2132097aa3cbd430d6ac71fe

(Side note: if anybody has a better solution than that specific hack,
 please don't hesitate to comment on the PR.)

Would it be allowable for us to drop support for this compiler? IIUC
Microsoft is no longer accepting feedback/bug reports for VS2017, let
alone VS2015. Are there any users who depend on libarrow building
with that compiler?


Re: [DISCUSS] Dropping support for Visual Studio 2015

2021-08-13 Thread Benjamin Kietzman
Thanks for commenting, all. I'll open a JIRA/PR to remove support next week.

On Tue, Aug 10, 2021, 09:34 Wes McKinney  wrote:

> +1 for dropping it also.
>
> On Mon, Aug 9, 2021 at 7:03 PM Keith Kraus 
> wrote:
> >
> > +1 as well. Is there any build platforms that we're currently supporting
> > that still use vs2015?
> >
> > Conda-forge did its migration ~1.5 years ago:
> > https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/501.
> >
> > -Keith
> >
> > On Mon, Aug 9, 2021 at 12:01 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > +1 for requiring a more recent MSVC version.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 09/08/2021 à 17:38, Benjamin Kietzman a écrit :
> > > > MSVC 19.0 is buggy enough that I for one have spent multiple days
> > > > reworking code that is fine for all other compilers we test against.
> > > > Most recently in the context of
> > > https://github.com/apache/arrow/pull/10793
> > > > (ARROW-13482) I found that for some types T,
> > > > `std::is_convertible::value` will be false. This necessitated
> the
> > > > following
> > > > (very hacky) workaround:
> > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/10793/commits/c44be29686af6fab2132097aa3cbd430d6ac71fe
> > > >
> > > >  (Side note: if anybody has a better solution than that specific
> > > hack,
> > > >   please don't hesitate to comment on the PR.)
> > > >
> > > > Would it be allowable for us to drop support for this compiler? IIUC
> > > > Microsoft is no longer accepting feedback/bug reports for VS2017, let
> > > > alone VS2015. Are there any users who depend on libarrow building
> > > > with that compiler?
> > > >
> > >
>


Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-17 Thread Benjamin Kietzman
WRT out-of-band data: if encapsulation is the priority over reuse of
Buffer etc that's straightforward to accommodate by replacement
with an alternative to Buffer. I have made that change to my PR in
https://github.com/apache/arrow/pull/10934/commits/ebd4fc665579dd6bba29c5c4731c2350ea0fa70a

> as much built-in support for "canonical" operators/functions (...) as
possible

IMHO as much as possible is not very much; every consumer is going to
have to write extensive validation passes anyway to assert contracts
inexpressible in a flatbuffer schema, for example whether a FieldRef
points to a field which is not present in a Schema. Special casing
for joins can at most provide validation of their arity and
slightly more idiomatic generated accessors for that relation's inputs.

To cite MLIR yet again, it is easier to write the sophisticated pattern
matching required for optimization and lowering passes when the objects
being manipulated are as consistently structured as possible.

In light of these points I'd disagree that special-casing Relations and
Expressions is preferable.

On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney  wrote:

> Looking at Ben's alternate PR [1], having an IR that leans heavily on
> memory references to an out-of-band data sidecar seems like an
> approach that would substantially ratchet up the implementation
> complexity as producing the IR would then have the level of complexity
> of producing the Arrow IPC format — when producing the "root" Plan
> message, you must accumulate a list of the dependent serialized
> submessages, placing the appropriate Buffer memory offset in the Plan
> message, like we do when producing the RecordBatch.buffers field. This
> seems complicated to me as you must devise a custom binary protocol to
> concatenate the serialized Plan and the messages it depends on into a
> single binary payload
>
> 
> 
> 
> 
> 
> 
> ...
> 
>
> (one purpose of FlatBufferBuilder is to spare you having to do this
> yourself — some reasons we do it for the Arrow IPC format is because
> appending Arrow memory buffers directly to a FlatBufferBuilder would
> be inefficient — internal realloc calls — and Flatbuffers are limited
> to 2GB. Neither of these things are problems here)
>
> In general, I believe the output created by an IR producer should be a
> single serialized object without any out-of-band data sidecar — this
> is much simpler for implementers and we can still provide an "escape
> hatch" for user-defined operators and functions where the required
> function/operator is passed opaquely as an embedded binary data.
>
> The serialization format (whether it is Flatbuffers or JSON, or
> something else) should allow for data memoization, so if there is a
> user-defined operator/function, or a relation that is used multiple
> times throughout the query (potentially with a large schema), then we
> should ensure that the data need not be duplicated in the
> serialization format unnecessarily — in Flatbuffers, you can achieve
> this by reusing offsets, but we could devise the data structures to
> make the memoization of "expensive" objects more explicit.
>
> I additionally think that it is important to provide as much built-in
> support for "canonical" operators/functions (such as the ones
> implemented commonly by SQL engines) as possible, and to liberally
> expand the catalogue of "built-in" capabilities. I would still prefer
> to have large unions/enums of built-in operators/functions and to
> expand those unions/enums to accommodate new things when it is
> demonstrated that there is a need to standardize things between
> producers/consumers.
>
> One of the beneficial properties of the Union/Enum approach for the
> operator/function catalogues, is that when there are additions to
> those enums, the generated Flatbuffers files will cause many language
> compilers to warn or error on unhandled enum cases. If all
> function/operator names are strings, then you are essentially
> reimplementing the functionality provided by enums by hand. I
> initially used strings for all function references in my original
> prototype, but I now think that using an enum for "built-ins" would be
> superior (because of the code-generated enum interface) and not a
> premature optimization.
>
> [1]: https://github.com/apache/arrow/pull/10934
>
> On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud  wrote:
> >
> > Hey all,
> >
> > Just wanted to give an update on the effort here.
> >
> > Ben Kietzman has created an alternative proposal to the initial design
> [1].
> > It largely overlaps with the original, but differs in a few important
> ways:
> >
> > * A big focus of the design is on flexibility, allowing producers,
> > consumers and ultimately end users of those systems the ability to define
> > custom operations in the graph.
> > * There are very few predefined relational operations (project, filter,
> > join and a handful of others).
> > * There are only 3 types of value expressions: literals, field
> refere

Re: [VOTE] Restart the Julia implementation with new repository and process

2021-09-26 Thread Benjamin Kietzman
+1 (binding)

On Sun, Sep 26, 2021, 23:08 Micah Kornfield  wrote:

> ,+1 (binding)
>
> On Sunday, September 26, 2021, Sutou Kouhei  wrote:
>
> > Hi,
> >
> > This vote is to determine if the Arrow PMC is in favor of
> > the Julia community moving the Julia implementation of
> > Apache Arrow out of apache/arrow into apache/arrow-julia.
> >
> > The Julia community uses a process like the Rust community
> > uses [1][2].
> >
> > Here is a summary of the process:
> >
> >   1. Use GitHub instead of JIRA for issue management platform
> >
> >  Note: Contributors will be required to write issues for
> >  planned features and bug fixes so that we have
> >  visibility and opportunities for collaboration before a
> >  PR shows up.
> >
> >  (This is for the Apache way.)
> >
> >  [1]
> >
> >   2. Release on demand
> >
> >  Like DataFusion.
> >
> >  Release for apache/arrow doesn't include the Julia
> >  implementation.
> >
> >  The Julia implementation uses separated version
> >  scheme. (apache/arrow uses 6.0.0 as the next version
> >  but the next Julia implementation release doesn't use
> >  6.0.0.)
> >
> >  [2]
> >
> > We'll create apache/arrow-julia and start IP clearance
> > process to import JuliaData/Arrow.jl to apache/arrow after
> > the vote is passed. (We don't use julia/arrow/ in
> > apache/arrow.)
> >
> > See also discussions about this: [3][4]
> >
> >
> > Please vote whether to accept the proposal and allow the
> > Julia community to proceed with the work.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 : Accept the proposal
> > [ ] 0 : No opinion
> > [ ] -1 : Reject proposal because...
> >
> >
> > [1] https://docs.google.com/document/d/1TyrUP8_
> > UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit
> > [2] https://github.com/apache/arrow-datafusion/blob/master/
> > dev/release/README.md
> > [3]
> https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801
> > a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
> > [4]
> https://lists.apache.org/x/thread.html/r0df7f44f7e1ed7f6e4352d34047d5
> > 3076208aa78aad308e30b58f83a%40%3Cdev.arrow.apache.org%3E
> >
> >
> > Thanks,
> > --
> > kou
> >
>


Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Benjamin Kietzman
+1 (binding)

On Wed, Aug 24, 2022, 11:32 Antoine Pitrou  wrote:

>
> Hello,
>
> I would like to propose that the Arrow C++ implementation switch to
> C++17 as its baseline supported version (currently C++11).
>
> The rationale and subsequent discussion can be read in the archives here:
> https://lists.apache.org/thread/9g14n3odhj6kzsgjxr6k6d3q73hg2njr
>
> The exact steps and timeline for switching can be decided later on, but
> this proposal implies that it could happen soon, possibly next week :-)
> ... or, more realistically, in the next Arrow C++ release, 10.0.0.
>
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Switch to C++17 in the impeding future
> [ ] +0
> [ ] -1 Do not switch to C++17 because...
>
>
> Regards
>
> Antoine.
>


Re: Substrait consumer for custom data sources

2022-09-27 Thread Benjamin Kietzman
It seems to me that your use case could be handled by defining a custom
NamedTableProvider and
assigning this to ConversionOptions::named_table_provider. This was added in
https://github.com/apache/arrow/pull/13613 to provide user configurable
dispatching for named tables;
if it doesn't address your use case then we might want to create a JIRA to
extend it.

On Tue, Sep 27, 2022 at 10:41 AM Li Jin  wrote:

> I did some more digging into this and have some ideas -
>
> Currently, the logic for deserialization named table is:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> and it will look up named tables from a user provided dictionary from
> string -> arrow Table.
>
> My idea is to make some short term changes to allow named tables to be
> dispatched differently (This logic can be reverted/removed once we figure
> out the proper way to support custom data sources, perhaps via substrait
> Extensions.), specifically:
>
> (1) The user creates named table with uris for custom data source, i.e.,
> "my_datasource://tablename?begin=20200101&end=20210101"
> (2) In the substrait consumer, allowing user to register custom dispatch
> rules based on uri scheme (similar to how exec node registry works), i.e.,
> sth like:
>
> substrait_named_table_registry.add("my_datasource", deser_my_datasource)
> and deser_my_datasource is a function that takes the NamedTable substrait
> message and returns a declaration.
>
> I know doing this just for named tables might not be a very general
> solution but seems the easiest path forward, and we can always remove this
> later in favor of a more generic solution.
>
> Thoughts?
>
> Li
>
>
>
>
>
> On Mon, Sep 26, 2022 at 10:58 AM Li Jin  wrote:
>
> > Hello!
> >
> > I am working on adding a custom data source node in Acero. I have a few
> > previous threads related to this topic.
> >
> > Currently, I am able to register my custom factory method with Acero and
> > create a Custom source node, i.e., I can register and execute this with
> > Acero:
> >
> > MySourceNodeOptions source_options = ...
> > Declaration source{"my_source", source_option}
> >
> > The next step I want to do is to pass this through to the Acero substrait
> > consumer. From previous discussions, I am going to use "NamedTable '' as
> a
> > temporary way to define my custom data source in substrait. My question
> is
> > this:
> >
> > What I need to do in substrait in order to register my own substrait
> > consumer rule/function for deserializing my custom named table protobuf
> > message into the declaration above. If this is not supported right now,
> > what is a reasonable/minimal change to make this work?
> >
> > Thanks,
> > Li
> >
>


Re: Serializing nested pandas dataframes

2020-10-30 Thread Benjamin Kietzman
Hi Adam,

Arrow does not support nesting tables inside other tables. However, a
record batch
is interchangeable with a struct array so you could achieve something
similar
by converting from a RecordBatch with columns `...c` to a StructArray with
child
arrays `...c`. In C++ we have /RecordBatch::{To,From}StructArray/ for this
purpose.
Only from_struct_array is exposed in python but to_struct_array would be a
simple
change to make.

Grouping could then be emulated by sorting the StructArray and wrapping it
in a
ListArray so that each list item contains the rows of a group. (This is
similar to
Impala's interpretation of list and map columns as persistent
joins/groupings
https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_queries
)

Would that be sufficient for your use case?

On Thu, Oct 29, 2020 at 5:19 PM Adam Lippai  wrote:

> This is what I want to extend for multiple tables:
>
> https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790
> I would need to come up with custom binary wrapper for multiple serialized
> pyarrow tables and since Arrow supports hierarchical data to some level, I
> was looking for built-in support of nested tables.
> I understand this might not be available on API level.
>
> Best regards,
> Adam Lippai
>
> On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai  wrote:
>
> > If I have a DataFrame with columns Date, Category, Value and group by
> > Category I'll have multiple DataFrames with Date, Value columns.
> > The result of the groupby is DataFrameGroupBy, which can't be serialized.
> > This is why I tried to assemble a nested DataFrame instead (like the one
> in
> > the SO link previously), but that doesn't work either.
> >
> > As Apache Arrow JS doesn't support groupby (processing the original DF on
> > the client-side), I was thinking of pushing the groupby operation to the
> > server side (pyarrow), doing the groupby in pandas before serializing and
> > sending it to the client.
> > I was wondering whether this (nested arrow tables) is a supported feature
> > or not (by calling chained table.toArray() or similar solution)
> > Currently I process it in pure JS, it's not that ugly, but not really
> > idiomatic either. The lack of Categorial data type and processing it row
> by
> > row certainly has it's perf. price.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> >> Can you give a more specific example of what kind of hierarchical data
> >> you want to serialize? (eg the output of a groupby operation in pandas
> >> typically is still a dataframe that can be converted to pyarrow and
> >> serialized).
> >>
> >> In general, for hierarchical data we have the nested data types (eg
> >> struct type when you nest "multiple columns in a single column").
> >>
> >> Joris
> >>
> >>
> >> On Thu, 29 Oct 2020 at 15:29, Adam Lippai  wrote:
> >> >
> >> > Hi,
> >> >
> >> > is there a way to serialize (IPC) hierarchical tabular data (eg.
> output
> >> of
> >> > pandas groupby) in python?
> >> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
> >> throws
> >> > error:
> >> >
> https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
> >> >
> >> > Best regards,
> >> > Adam Lippai
> >>
> >
>


Re: Using arrow/compute/kernels/*internal.h headers

2020-11-09 Thread Benjamin Kietzman
Hi Niranda,

> unfortunately, I was unable to find a header that exposes a
SumAggregateKernel

Sorry, I misspelled that! Should have written ScalarAggregateKernel, which
is in kernel.h

On Mon, Nov 9, 2020 at 11:09 AM Wes McKinney  wrote:

> On Mon, Nov 9, 2020 at 9:32 AM Niranda Perera 
> wrote:
> >
> > @Ben
> > Thank you very much for the feedback. But unfortunately, I was unable to
> > find a header that exposes a SumAggregateKernel in the v2.0.0. Maybe I am
> > checking it wrong. I remember accessing them in v0.16 IINM.
> >
> > @Wes
> > Yes, that would be great. How about adding a CMake compilation flag for
> > such dev use cases?
> >
>
> This seems like it could cause more problems -- I think it would be
> sufficient to use an "internal::" C++ namespace and always install the
> relevant header file
>
> >
> >
> > On Sun, Nov 8, 2020 at 9:14 PM Wes McKinney  wrote:
> >
> > > I'm not opposed to installing headers that provide access to some of
> > > the kernel implementation internals (with the caveat that changes
> > > won't go through a deprecation cycle, so caveat emptor). It might be
> > > more sustainable to think about what kind of stable-ish public API
> > > could be exported to support applications like Cylon.
> > >
> > > On Sun, Nov 8, 2020 at 10:37 AM Ben Kietzman 
> > > wrote:
> > > >
> > > > Hi Niranda,
> > > >
> > > > SumImpl is a subclass of KernelState. Given a SumAggregateKernel,
> one can
> > > > produce zeroed KernelState using the `init` member, then operate on
> data
> > > > using the `consume`, `merge`, and `finalize` members. You can look at
> > > > ScalarAggExecutor for an example of how to get from a compute
> function to
> > > > kernels and kernel state. Will that work for you?
> > > >
> > > > Ben Kietzman
> > > >
> > > > On Sun, Nov 8, 2020, 11:21 Niranda Perera 
> > > wrote:
> > > >
> > > > > Hi Ben,
> > > > >
> > > > > We are building a distributed table abstraction on top of Arrow
> > > dataframes
> > > > > called Cylon (https://github.com/cylondata/cylon). Currently we
> have a
> > > > > simple aggregation and group-by operation implementation. But we
> felt
> > > like
> > > > > we can give more functionality if we can import arrow kernels and
> > > states to
> > > > > corresponding cylon distributed kernels.
> > > > > Ex: For distributed mean, we would have to communicate the local
> arrow
> > > > > SumState and then do a SumImpl::MergeFrom() and the call Finalize.
> > > > > Is there any other way to access these intermediate states from
> compute
> > > > > operations?
> > > > >
> > > > > On Sun, Nov 8, 2020 at 11:11 AM Ben Kietzman <
> b...@ursacomputing.com>
> > > > > wrote:
> > > > >
> > > > > > Ni Niranda,
> > > > > >
> > > > > > What is the context of your work? if you're working inside the
> arrow
> > > > > > repository you shouldn't need to install headers before using
> them,
> > > and
> > > > > we
> > > > > > welcome PRs for new kernels. Otherwise, could you provide some
> > > details
> > > > > > about how your work is using Arrow as a dependency?
> > > > > >
> > > > > > Ben Kietzman
> > > > > >
> > > > > > On Sun, Nov 8, 2020, 10:57 Niranda Perera <
> niranda.per...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I was wondering if I could use the
> > > arrow/compute/kernels/*internal.h
> > > > > > > headers in my work? I would like to reuse some of the kernel
> > > > > > > implementations and kernel states.
> > > > > > >
> > > > > > > With -DARROW_COMPUTE=ON, those headers are not added into the
> > > include
> > > > > > dir.
> > > > > > > I see that the *internal.h headers are skipped from
> > > > > > > the ARROW_INSTALL_ALL_HEADERS cmake function unfortunately.
> > > > > > >
> > > > > > > Best
> > > > > > > --
> > > > > > > Niranda Perera
> > > > > > > @n1r44 
> > > > > > > +1 812 558 8884 / +94 71 554 8430
> > > > > > > https://www.linkedin.com/in/niranda
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Niranda Perera
> > > > > @n1r44 
> > > > > +1 812 558 8884 / +94 71 554 8430
> > > > > https://www.linkedin.com/in/niranda
> > > > >
> > >
> >
> >
> > --
> > Niranda Perera
> > @n1r44 
> > +1 812 558 8884 / +94 71 554 8430
> > https://www.linkedin.com/in/niranda
>


[DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-11 Thread Benjamin Kietzman
In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
https://github.com/apache/arrow/pull/8023 which port the parquet-mr
design to c++: there's some question whether that design is consistent
with the style and conventions of the c++ implementation of parquet.

Here is a Gist with a sketched alternative design adapted to c++
https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43

Specific concerns in brief:
- Multiple internal classes are left public in header files, where it would
be
  preferred that public classes be kept to a minimum.
- Several levels of explicit synchronization with mutexes are used to
guarantee
  that each class is completely thread safe, but this is unnecessary
overhead
  given the pattern of use of `parquet-cpp`.


Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-15 Thread Benjamin Kietzman
@Gidon

Copied the gist to a google doc for commenting:
https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit#

@Micah

> it would be preferable to put them in an internal namespace to ensure
adequate unit testing is in place.

Agreed; my intent was only to indicate that implementation details should
be private. We can even have
*_internal.h header files for those to make them available for unit testing
as we do with some other
classes. I've added a note of this caveat to the doc.

> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
still be implemented
> with internal synchronization to guarantee thread-safety?

If these are exclusively accessed from the main thread (where they are used
to assemble
FileDecryptionProperties) then there is still no need to synchronize them.

@Tham

> As I see, the returned FileDecryptionProperties object from 
> PropertiesDrivenCryptoFactory
is not mutable,
> since it has a DecryptionKeyRetriever object.

If the DecryptionKeyRetriever accesses the caches directly, then it will
indeed need to be synchronized.
Instead, we can ensure that cache access happens on the main thread before
worker tasks are launched.
After we assemble this ahead-of-time set of keys it will not change during
the course of a read, so the
DecryptionKeyRetriever can safely access it without mutexes. I've added a
comment to the doc

On Fri, Nov 13, 2020 at 3:09 AM Gidon Gershinsky  wrote:

> Hi all,
>
> Glad to see the parquet-cpp progress on this! Can I suggest creating a
> googledoc for the technical discussion? The current md doc format seems to
> be harder for pinpointed comments. I got a few, but they are too minor for
> sending to the two mailing lists.
>
> Cheers, Gidon
>
>
> On Fri, Nov 13, 2020 at 9:17 AM Tham Ha  wrote:
>
>> Hi Ben,
>>
>> Can you reconsider this point:
>> > Properties (including FileDecryptionProperties) are immutable after
>> construction and thus can be safely shared between threads with no further
>> synchronization.
>>
>> As I see, the returned FileDecryptionProperties object from 
>> PropertiesDrivenCryptoFactory
>> is not mutable, since it has a DecryptionKeyRetriever object (i.e. 
>> FileKeyUnwrapper
>> object in this pull) that helps get the key from key metadata.
>> From the current implementation of FileKeyUnwrapper with the cache, I
>> think synchronization is necessary, unless you have another idea.
>>
>> Cheers, Tham
>>
>> On Fri, Nov 13, 2020 at 1:28 PM Micah Kornfield 
>> wrote:
>>
>>> I skimmed through and this seems like a clean design (I would have to
>>> reread the PR to do a comparison.  A few thoughts of the top of my head:
>>>
>>>
>>> > - Multiple internal classes are left public in header files, where it
>>> > would be
>>> >   preferred that public classes be kept to a minimum.
>>>
>>> I think some of these classes are non-trivial.  If that  is the case it
>>> would be preferable to put them in an internal namespace to ensure
>>> adequate unit testing is in place.
>>>
>>> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
>>> still be implemented with internal synchronization to guarantee
>>> thread-safety?
>>>
>>>
>>>
>>> On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
>>> wrote:
>>>
>>> > In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
>>> > https://github.com/apache/arrow/pull/8023 which port the parquet-mr
>>> > design to c++: there's some question whether that design is consistent
>>> > with the style and conventions of the c++ implementation of parquet.
>>> >
>>> > Here is a Gist with a sketched alternative design adapted to c++
>>> > https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
>>> >
>>> > Specific concerns in brief:
>>> > - Multiple internal classes are left public in header files, where it
>>> > would be
>>> >   preferred that public classes be kept to a minimum.
>>> > - Several levels of explicit synchronization with mutexes are used to
>>> > guarantee
>>> >   that each class is completely thread safe, but this is unnecessary
>>> > overhead
>>> >   given the pattern of use of `parquet-cpp`.
>>> >
>>>
>>
>>
>> --
>> Tham Ha Thi
>> C++ Developer
>> *EMOTIV* www.emotiv.com
>> Neurotech for the Global Community
>>
>> LEGAL NOTICE
>>
>> This message (including all att

Re: Using arrow/compute/kernels/*internal.h headers

2020-11-17 Thread Benjamin Kietzman
Hi Niranda,

hastebin: That looks generally correct, though I should warn you that a
recent PR
( https://github.com/apache/arrow/pull/8574 ) changed the return type of
DispatchExact to Kernel so you'll need to insert an explicit cast to
ScalarAggregateKernel.

1: This seems like a feature which might be generally useful to consumers of
the compute module, so it's probably worth adding to the KernelState
interface
in some way rather than exposing individual implementations. I've opened
https://issues.apache.org/jira/browse/ARROW-10630 to track this feature

2: I would not expect your container to contain a large number of
KernelStates.
Specifically: why would you need more than one per local thread?
Additionally
for the specific case of aggregation I'd expect that KernelStates not
actively in
use would be `merge`d and no longer stored. With small numbers of instances
I would expect the memory overhead due to polymorphism to be negligible.

On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera 
wrote:

> Hi Ben and Wes,
> Based on our discussion, I did the following.
> https://hastebin.com/ajadonados.cpp
>
> It seems to be working fine. Would love to get your feedback on this! :-)
>
> But I have a couple of concerns.
> 1. Say I want to communicate the intermediate state data across multiple
> processes. Unfortunately, KernelState struct does not expose the data
> pointer to the outside. If say SumState is exposed, we could have accessed
> that data, isn't it? WDYT?
> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
> intermediate state would be a {T, int64} tuple. But I believe the size of
> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
> sizeof(T) + sizeof(int64) + sizeof(ScalarAggregator) + sizeof(vptr), isn't
> it? So, if I am managing a compute state container, this means that my
> memory requirement would be higher than simply using a {T, int64} tuple.
> Please correct me if I am wrong. I am not sure if there is a better
> solution to this, but just want to discuss it with you.
>
>
> On Tue, Nov 10, 2020 at 9:44 AM Wes McKinney  wrote:
>
>> Yes, open a Jira and propose a PR implementing the changes you need
>>
>> On Mon, Nov 9, 2020 at 8:31 PM Niranda Perera 
>> wrote:
>> >
>> > @wes How should I proceed with this nevertheless? should I open a JIRA?
>> >
>> > On Mon, Nov 9, 2020 at 11:09 AM Wes McKinney 
>> wrote:
>> >
>> > > On Mon, Nov 9, 2020 at 9:32 AM Niranda Perera <
>> niranda.per...@gmail.com>
>> > > wrote:
>> > > >
>> > > > @Ben
>> > > > Thank you very much for the feedback. But unfortunately, I was
>> unable to
>> > > > find a header that exposes a SumAggregateKernel in the v2.0.0.
>> Maybe I am
>> > > > checking it wrong. I remember accessing them in v0.16 IINM.
>> > > >
>> > > > @Wes
>> > > > Yes, that would be great. How about adding a CMake compilation flag
>> for
>> > > > such dev use cases?
>> > > >
>> > >
>> > > This seems like it could cause more problems -- I think it would be
>> > > sufficient to use an "internal::" C++ namespace and always install the
>> > > relevant header file
>> > >
>> > > >
>> > > >
>> > > > On Sun, Nov 8, 2020 at 9:14 PM Wes McKinney 
>> wrote:
>> > > >
>> > > > > I'm not opposed to installing headers that provide access to some
>> of
>> > > > > the kernel implementation internals (with the caveat that changes
>> > > > > won't go through a deprecation cycle, so caveat emptor). It might
>> be
>> > > > > more sustainable to think about what kind of stable-ish public API
>> > > > > could be exported to support applications like Cylon.
>> > > > >
>> > > > > On Sun, Nov 8, 2020 at 10:37 AM Ben Kietzman <
>> b...@ursacomputing.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > Hi Niranda,
>> > > > > >
>> > > > > > SumImpl is a subclass of KernelState. Given a
>> SumAggregateKernel,
>> > > one can
>> > > > > > produce zeroed KernelState using the `init` member, then
>> operate on
>> > > data
>> > > > > > using the `consume`, `merge`, and `finalize` members. You can
>> look at
>> > > > > > ScalarAggExecutor for an example of how to get from a compute
>> > > function to
>> > > > > > kernels and kernel state. Will that work for you?
>> > > > > >
>> > > > > > Ben Kietzman
>> > > > > >
>> > > > > > On Sun, Nov 8, 2020, 11:21 Niranda Perera <
>> niranda.per...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi Ben,
>> > > > > > >
>> > > > > > > We are building a distributed table abstraction on top of
>> Arrow
>> > > > > dataframes
>> > > > > > > called Cylon (https://github.com/cylondata/cylon). Currently
>> we
>> > > have a
>> > > > > > > simple aggregation and group-by operation implementation. But
>> we
>> > > felt
>> > > > > like
>> > > > > > > we can give more functionality if we can import arrow kernels
>> and
>> > > > > states to
>> > > > > > > corresponding cylon distributed kernels.
>> > > > > > > Ex: For distributed mean, we would have to communicate the
>> local
>> > > arrow
>> > > > > > > SumState and then do a

Re: Using arrow/compute/kernels/*internal.h headers

2020-11-18 Thread Benjamin Kietzman
1: Excellent!

2: The open JIRA for grouped aggregation is
https://issues.apache.org/jira/browse/ARROW-4124
(though it's out of date since it predates the addition of
ScalarAggregateKernel).
To summarize: for *grouped* aggregation we want the kernel to do the work
of evaluating group
condition(s) and the corresponding KernelState should include any hash
tables and a columnar
(for mean, this would be an array of struct) store of
pre-finalized results.
Pushing this into the kernel yields significant perf wins since layout of
state can be controlled
more freely and we avoid talking across a virtual/type erased interface
inside a hot loop.
A sketch of how ARROW-4124 could be resolved within the
compute::{Function,Kernel} interface:
- add compute functions like "grouped_mean" which are binary.
- the first argument is the array to aggregate while the second is the
grouping condition
- compute kernels for these functions can probably still be
ScalarAggregateKernels (though after
  this we may need to rename them); the interface to implement would be
ScalarAggregator, which
  is in compute/kernels/aggregate_internal.h
- The new ScalarAggregator would be able to reuse existing arrow internals
such as our HashTable
  and could store pre-finalized results in a mutable ArrayData (which would
make finalizing pretty
  trivial).

This returns to your original question about how to get access to arrow
internals like ScalarAggregator
outside the arrow repo. However hopefully after this discussion it seems
feasible and preferable to
add the functionality you need upstream

On Tue, Nov 17, 2020 at 9:58 PM Niranda Perera 
wrote:

> 1. This is great. I will follow this JIRA. (better yet, I'll see if I can
> make that contribution)
>
> 2. If we forget about the multithreading case for a moment, this
> requirement came up while implementing a "groupby + aggregation" operation
> (single-threaded). Let's assume that a table is not sorted. So, the
> simplest approach would be to keep the intermediate state in a container
> (map/vector) and update the state while traversing the table. This approach
> becomes important when there is a large number of groups and there aren't
> enough rows with the same key to use 'consume' vector aggregation (on a
> sorted table).
>
>
>
> On Tue, Nov 17, 2020 at 10:54 AM Benjamin Kietzman 
> wrote:
>
>> Hi Niranda,
>>
>> hastebin: That looks generally correct, though I should warn you that a
>> recent PR
>> ( https://github.com/apache/arrow/pull/8574 ) changed the return type of
>> DispatchExact to Kernel so you'll need to insert an explicit cast to
>> ScalarAggregateKernel.
>>
>> 1: This seems like a feature which might be generally useful to consumers
>> of
>> the compute module, so it's probably worth adding to the KernelState
>> interface
>> in some way rather than exposing individual implementations. I've opened
>> https://issues.apache.org/jira/browse/ARROW-10630 to track this feature
>>
>> 2: I would not expect your container to contain a large number of
>> KernelStates.
>> Specifically: why would you need more than one per local thread?
>> Additionally
>> for the specific case of aggregation I'd expect that KernelStates not
>> actively in
>> use would be `merge`d and no longer stored. With small numbers of
>> instances
>> I would expect the memory overhead due to polymorphism to be negligible.
>>
>> On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera 
>> wrote:
>>
>>> Hi Ben and Wes,
>>> Based on our discussion, I did the following.
>>> https://hastebin.com/ajadonados.cpp
>>>
>>> It seems to be working fine. Would love to get your feedback on this!
>>> :-)
>>>
>>> But I have a couple of concerns.
>>> 1. Say I want to communicate the intermediate state data across multiple
>>> processes. Unfortunately, KernelState struct does not expose the data
>>> pointer to the outside. If say SumState is exposed, we could have accessed
>>> that data, isn't it? WDYT?
>>> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
>>> intermediate state would be a {T, int64} tuple. But I believe the size of
>>> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
>>> sizeof(T) + sizeof(int64) + sizeof(ScalarAggregator) + sizeof(vptr),
>>> isn't it? So, if I am managing a compute state container, this means that
>>> my memory requirement would be higher than simply using a {T, int64} tuple.
>>> Please correct me if I am wrong. I am not sure if there is a better
>>> solution to this, but jus

Re: Arrow sync call January 20 at 12:00 US/Eastern, 17:00 UTC

2021-01-20 Thread Benjamin Kietzman
Attendees:
Ben Kietzman
Jonathan Keane
Tiffany Hoi Ling Lam
Rok Mihevc
Weston Pace
Prudhvi Porandla

Discussion:
* Flight SQL proposal (
https://docs.google.com/document/d/1WQz32bDF06GgMdEYyzhakqUigBZkALFwDF2y1x3DTAI
) will be promoted to a PR soon

On Tue, Jan 19, 2021 at 6:59 PM Neal Richardson 
wrote:

> Hi all,
> Reminder that our biweekly call is tomorrow at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be shared with the mailing list afterward.
>
> Neal
>


Re: [VOTE] Release Apache Arrow 3.0.0 - RC2

2021-01-25 Thread Benjamin Kietzman
+1 (non-binding)

Verified c++ and python on Windows

On Mon, Jan 25, 2021 at 11:47 AM Krisztián Szűcs 
wrote:

> On Fri, Jan 22, 2021 at 2:46 PM Neville Dipale 
> wrote:
> >
> > Hi Krisztian,
> >
> > The full output is at
> > https://gist.github.com/nevi-me/88a6279dd90aea30aa4caaa15fb0cc53
> There are failures which seem valid, but didn't occur in the
> automatized verification build.
> >
> > I also ran dev/release/verify-release-candidate-wheels.bat 3.0.0 2
> >
> > Getting the below error, it seems to be a Python 3.7 bug, but I'm not yet
> > finding a solution for it online.
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>pip install
> > pyarrow-3.0.0-cp36-cp36m-win_amd64.whl   || EXIT /B 1
> > Failed to import the site module
> > Traceback (most recent call last):
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 579, in 
> > main()
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 566, in main
> > known_paths = addsitepackages(known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 349, in
> addsitepackages
> > addsitedir(sitedir, known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 207, in addsitedir
> > addpackage(sitedir, name, known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 159, in addpackage
> > f = open(fullname, "r")
> >   File "C:\Users\nevi\anaconda3\lib\_bootlocale.py", line 12, in
> > getpreferredencoding
> > if sys.flags.utf8_mode:
> > AttributeError: 'sys.flags' object has no attribute 'utf8_mode'
> This failure seems to be related to the python installation, the
> wheels are supposed to work in the official windows python containers
> according to the automatized tests [1].
>
> Thanks Neville for running the verification on windows! Sadly we don't
> have other verifications from windows, so I need to assume that those
> failures are environment/compiler specific.
>
> [1]: https://github.com/ursacomputing/crossbow/runs/1725090895
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>if errorlevel 1 GOTO error
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>call deactivate
> > DeprecationWarning: 'deactivate' is deprecated. Use 'conda deactivate'.
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>conda.bat deactivate
> >
> > C:\tmp\arrow-verify-release-wheels>cd C:\Users\nevi\Work\oss\arrow
> >
> > C:\Users\nevi\Work\oss\arrow>EXIT /B 1
> >
> > On Fri, 22 Jan 2021 at 15:14, Krisztián Szűcs  >
> > wrote:
> >
> > > Thanks Neville for testing it!
> > >
> > > There should be more context about the failures above the summary.
> > > Could you please post the errors?
> > >
> > >
> > > On Fri, Jan 22, 2021 at 2:05 PM Neville Dipale 
> > > wrote:
> > > >
> > > > (+0 non-binding)
> > > >
> > > > Getting test failures (see end of my mail).
> > > >
> > > > This is my first time verifying (Windows 10; Insider Preview if
> > > relevant),
> > > > so I'm
> > > > likely missing something in my config. I'll read the verification
> script
> > > > and try again.
> > > >
> > > > I ran the below using PowerShell:
> > > >
> > > > $env:ARROW_GANDIVA=0; $env:ARROW_PLASMA=0; $env:TEST_DEFAULT=0;
> > > > $env:TEST_SOURCE=1; $env:TEST_CPP=1; $env:TEST_PYTHON=1;
> > > > $env:TEST_JAVA=1; $env:TEST_INTEGRATION_CPP=1;
> > > > $env:TEST_INTEGRATION_JAVA=1;
> > > > ./dev/release/verify-release-candidate.bat 3.0.0 2
> > > >
> > > > I had to change cmake generator to use the below (in line 53):
> > > >
> > > > set GENERATOR=Visual Studio 16 2019
> > > >
> > > > VS 2017 wasn't working for me, even after installing its build tools.
> > > >
> > > > 
> > > >
> > > > I have 3 test failures per below:
> > > >
> > > > 94% tests passed, 3 tests failed out of 52
> > > >
> > > > Label Time Summary:
> > > > arrow-tests   =  66.55 sec*proc (30 tests)
> > > > arrow_compute =   3.38 sec*proc (4 tests)
> > > > arrow_dataset =   0.89 sec*proc (9 tests)
> > > > arrow_flight  =   0.92 sec*proc (1 test)
> > > > arrow_python-tests=   0.45 sec*proc (1 test)
> > > > filesystem=   6.36 sec*proc (2 tests)
> > > > parquet-tests =   7.22 sec*proc (7 tests)
> > > > unittest  =  79.41 sec*proc (52 tests)
> > > >
> > > > Total Test time (real) =  80.27 sec
> > > >
> > > > The following tests FAILED:
> > > >  45 - arrow-python-test (Failed)
> > > >  46 - parquet-internals-test (Failed)
> > > >  49 - parquet-arrow-test (Failed)
> > > > Errors while running CTest
> > > >
> > > > I don't know if this has any significance, I got the errors from 2
> runs.
> > > >
> > > > Neville
> > > >
> > > > On Fri, 22 Jan 2021 at 14:22, Neville Dipale 
> > > wrote:
> > > >
> > > > > This is my first time verifying, do I also need to set the env vars
> > > below?
> > > > >
> > > > > ARROW_GANDIV

Re: [C++] Conventions around C++ shared_ptr in the code base?

2021-02-19 Thread Benjamin Kietzman
Thanks for looking into this, Micah.

One convention I'd like to append: we mostly avoid convenience typedefs,
but an
exception is the common case of `vector>`, for
which we
allow and encourage the use of `{class name}Vector` typedefs. (Conversely,
nothing
should be named /^\w+Vector$/ which isn't a vector-of-shared_ptr typedef.)

1. Agreed on Array::data(), in fact I made this change in #9490 (ARROW-9196)

2. Is it worthwhile to keep RecordBatch abstract? I am only aware of a
single concrete
   subclass (SimpleRecordBatch). I agree that RecordBatch's assessors
should avoid
   unnecessary construction of shared_ptrs, and demoting it to a concrete
class would
   make it clear that this is safe.

Ben

On Mon, Feb 15, 2021 at 11:55 PM Micah Kornfield 
wrote:

> (Apologies if this is a double send)
>
> I'll open a PR on this soon. To update the dev guide.
>
> Given this standard there are few accessor methods that I think we should
> either convert or create a new accessor that does the correct thing with
> respect to return type.  Given how core these methods are I think the
> latter might be a better approach (but I don't feel too strongly if others
> have a good rationale one way or another):
> 1. Array::Data() [1] - In looking at some CPU profiles it seems like most
> of the time spent in Validate is due to shared_ptr
> construction/destruction.  In auditing the code this method appears to be
> the only one returning copies.
>
> 2. RecordBatch::Column* [2] - These are more questionable since they are
> virtual methods, it is not clear if dynamic Record batches where the
> intention behind this design. So it might not be worth it.  Anecdotally,
> I've known people who have written some naive iteration code use these
> methods where shared_ptr construction/destruction contributed 10% overhead.
>
> Thanks,
> Micah
>
>
>
> [1]
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L163
> [2]
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L98
>
> On Mon, Feb 8, 2021 at 10:09 AM Wes McKinney  wrote:
>
> > Agreed. We should probably document this in the C++ developer docs.
> >
> > On Mon, Feb 8, 2021 at 12:04 PM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Hi Micah,
> > >
> > > That's roughly my mental model as well.
> > >
> > > However, for 4) I would say that return a const ref to shared_ptr if
> > > preferable because the caller will often need the ownership (especially
> > > with Array, ArrayData, DataType, etc.).
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 08/02/2021 à 18:02, Micah Kornfield a écrit :
> > > > I'm not sure how consistent we are with how shared_ptr is used as a
> > > > parameter to methods and as a return type.  In reviewing and writing
> > code
> > > > I've been using these guidelines for myself and I was wondering if
> they
> > > > align with others:
> > > >
> > > > 1.  If a copy of a shared_ptr is not intended to be made by the
> method
> > then
> > > > use a const ref to underlying type.  i.e. void Foo(const Array&
> array)
> > is
> > > > preferable to void Foo(const shared_ptr& array) [1].
> > > >
> > > > 2.  If a copy is always going to be made pass by value.  i.e. void
> > > > Foo(std::shared_ptr) and to std::move within the method.  The
> > last
> > > > time I did research on this allowed for eliminating shared_ptr
> > overhead if
> > > > the caller also can std::move() the parameter.
> > > >
> > > > 3.  If a copy might be made pass the shared_ptr by const reference.
> > i.e. void
> > > > Foo(const shared_ptr& array) The exception to this if the contents
> > of
> > > > the shared_ptr a reference can effectively be copied cheaply without
> > as is
> > > > the case with Array via ArrayData in which case #1 applies.
> > > >
> > > > 4. For accessor methods prefer returning by const ref or underlying
> > ref to
> > > > underlying when appropriate. i.e. const std::shared_ptr&
> > foo()  or
> > > > const Array& Foo().
> > > >
> > > > 5. For factory like methods return a copy i.e. std::shared_ptr
> > > > MakeFoo();
> > > >
> > > > Is this other people's mental model?  I'd like to update our style
> > guide so
> > > > we can hopefully drive consistency over time.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] Array is somewhat of a special case because one can have
> > essentially
> > > > the same shared_ptr copy semantics by copying the underlying
> ArrayData
> > > > object.
> > > >
> >
>


[DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Benjamin Kietzman
Original discussion at
https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR for
https://issues.apache.org/jira/browse/ARROW-11595 )

Although the format does not specify what is contained in array slots
masked by null bits (for example the first byte in the data buffer of an
int8 array whose first slot is null), there are other considerations which
might motivate establishing conventions for some arrays created by the C++
implementation:
- No spurious complaints from valgrind when running otherwise safe
element-wise compute kernels on values under null bits. In the case of
ARROW-11595, the values buffer of the result of casting from Type::NA to
Type::INT8 is left uninitialized but masked by an entirely-null validity
bitmap. When such an array is passed to a comparison kernel, a branch on
the uninitialized values triggered valgrind even though the results of that
branch were also masked by an empty validity bitmap.
- If the underlying values were allocated but not initialized they may leak
private information such as private keys, passwords, or tokens which were
placed in that memory then freed by an application without overwrite
- Improved compression of data buffers (for example in writing to the IPC
format), since a run of nulls would correspond to consistent, repeated
values in all buffers
- Deterministic output from operations which are unable to honor null
bitmaps, such as computing the checksum of an IPC file


Re: [C++][CI] Make "C++ on s390x" build mandatory?

2021-02-23 Thread Benjamin Kietzman
+1 for making it mandatory

On Tue, Feb 23, 2021, 07:07 Krisztián Szűcs 
wrote:

> Hi!
>
> On Tue, Feb 23, 2021 at 11:53 AM Antoine Pitrou 
> wrote:
> >
> >
> > Hello,
> >
> > For a while we've had a big endian (s390x-based) build on Travis-CI.
> > The build is optional, meaning errors don't actually fail the CI.
> >
> > The build has been reasonably stable for some time apart for some
> > occasional regressions, which often don't get spotted because the build
> > is reported as "green" anyway (because it's optional).
> >
> > I propose we make the build mandatory, to avoid missing further
> > regressions.  What do you think?
> Sounds good to me.
> >
> > Regards
> >
> > Antoine.
>


Re: [VOTE] Allow source-only release vote for patch releases

2021-02-27 Thread Benjamin Kietzman
+1 (non binding)

On Sat, Feb 27, 2021, 11:23 Eric Burden  wrote:

> +1
>
> On Sat, Feb 27, 2021, 10:19 AM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> > We've had some discussion about ways to reduce the cost of releasing and
> > ways to allow maintainers of subprojects to make more frequent
> maintenance
> > releases. In [1] we proposed allowing maintenance/patch releases on which
> > we vote only to approve the source package, unlike our quarterly major
> > releases, where we vote on the source and on most binary packages as
> well.
> > Maintainers of the various implementations and subprojects may choose to
> > build and publish binary artifacts from these patch release sources after
> > the release vote, if there are relevant bug fixes in the patch release.
> > This procedure will allow us to make patch releases more easily, and we
> > maintain our shared mapping between a GitHub tag/commit and a release
> > number across all subprojects.
> >
> > Please vote whether to adopt the patch release procedure. The vote will
> be
> > open for at least 72 hours.
> >
> > [ ] +1 Allow source-only patch release votes
> > [ ] +0
> > [ ] -1 Do not allow source-only patch release votes because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Neal
> >
> > [1]:
> >
> >
> https://lists.apache.org/thread.html/r0ff484bcf8e410730ddcba447ff0610e7138f16d035c43a4015da187%40%3Cdev.arrow.apache.org%3E
> >
>


Re: Question about joining two tables

2021-03-11 Thread Benjamin Kietzman
Hi,

This is not yet implemented but it is on the roadmap for the near future:
https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ

Ben Kietzman

On Thu, Mar 11, 2021 at 12:33 PM Kirill Lykov 
wrote:

> Hi,
>
> Is it possible somehow using existing compute functionality or some other
> code to join two tables by values in a common column?
>
> --
> Best regards,
> Kirill Lykov
>


[DISCUSS][C++] Reduce usage of KernelContext in compute::

2021-03-11 Thread Benjamin Kietzman
KernelContext is a tuple consisting of a pointers to an ExecContext and
KernelState
and an error Status. The context's error Status may be set by compute
kernels (for
example when divide-by-zero would occur) rather than returning a Result as
in the
rest of the codebase. IIUC the intent is to avoid branching on always-ok
Statuses
for kernels which don't have an error condition (for example addition
without overflow
checks).

If there's a motivating performance reason for non standard error
propagation then
we should continue using KernelContext wherever we can benefit from it.
However,
several other APIs (such as Kernel::init) also use a KernelContext to
report errors.
IMHO, it would be better to avoid the added cognitive overhead of handling
errors
through KernelContext outside hot loops which benefit from it.

Am I missing anything? Is there any reason (for example) Kernel::init
shouldn't just
return a Result>?

Ben Kietzman


Re: Question about joining two tables

2021-03-11 Thread Benjamin Kietzman
Hi Aldrin,

We don't have a unified repository for design docs that I'm aware of.
Governance-wise only JIRA and the mailing lists are canonical, but
IIUC it'd be legal, straightforward, and beneficial to provide a directory
like the one you describe of "design docs proposed to the ML" or so.

Ben Kietzman

On Thu, Mar 11, 2021 at 3:02 PM Wes McKinney  wrote:

> This is a new document that we just started earlier this week. I'd put
> together some docs in the past to try to bootstrap community
> organization on this, but since we're now finally putting hands to
> code after setting up some critical dependencies (like the Datasets
> interface, which is needed to implement scans, and the array
> function/kernels framework), the documents + supporting Jira issues
> will be more specifically about an implementation plan. We'd welcome
> participation from anyone who is interested in this critical project.
>
> On Thu, Mar 11, 2021 at 1:55 PM Aldrin  wrote:
> >
> > Hi Ben, thanks for the link!
> >
> > I will eventually be interested in this direction as well, but hadn't
> seen
> > this document. Is there a place where these design documents can be
> found?
> > I've seen this and a few other google doc links floating around the
> mailing
> > list but I can't figure out how to navigate to a google drive or a page
> > enumerating the various documents.
> >
> > Thank you!
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Thu, Mar 11, 2021 at 10:07 AM Benjamin Kietzman 
> > wrote:
> >
> > > Hi,
> > >
> > > This is not yet implemented but it is on the roadmap for the near
> future:
> > >
> > >
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ
> > >
> > > Ben Kietzman
> > >
> > > On Thu, Mar 11, 2021 at 12:33 PM Kirill Lykov 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Is it possible somehow using existing compute functionality or some
> other
> > > > code to join two tables by values in a common column?
> > > >
> > > > --
> > > > Best regards,
> > > > Kirill Lykov
> > > >
> > >
>


Re: [DISCUSS][C++] Reduce usage of KernelContext in compute::

2021-03-12 Thread Benjamin Kietzman
My primary point is that using KernelContext to hold error statuses is
confusing
since there are more places to check for an error condition. In the rest of
the
c++ library we use RETURN_NOT_OK or ARROW_ASSIGN_OR_RAISE to
handle stack unwinding from an error, but in the presence of KernelContext
it's
also necessary to check KernelContext::HasError.

The specific case of a Kernel::init which must allocate actually provides
a good example of this: KernelContext includes helper methods AllocateBuffer
and AllocateBitmap which use the context's memory pool. These return
Result<>,
whose status must then be checked and errors propagated using
KernelContext::SetStatus (and not ARROW_ASSIGN_OR_RAISE as
in the rest of the c++ library) since Kernel::init doesn't support status
returns.

IMHO it'd increase readability of kernel code to handle errors uniformly
wherever possible.

On Fri, Mar 12, 2021, 01:00 Yibo Cai  wrote:

> Beside reporting errors, maybe a kernel wants to allocate memory through
> KernelContext::memory_pool [1] in Kernel::init?
> I'm not quite sure if this is a valid case. Would like to hear other
> comments.
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L95
>
> Yibo
>
> On 3/12/21 5:24 AM, Benjamin Kietzman wrote:
> > KernelContext is a tuple consisting of a pointers to an ExecContext and
> > KernelState
> > and an error Status. The context's error Status may be set by compute
> > kernels (for
> > example when divide-by-zero would occur) rather than returning a Result
> as
> > in the
> > rest of the codebase. IIUC the intent is to avoid branching on always-ok
> > Statuses
> > for kernels which don't have an error condition (for example addition
> > without overflow
> > checks).
> >
> > If there's a motivating performance reason for non standard error
> > propagation then
> > we should continue using KernelContext wherever we can benefit from it.
> > However,
> > several other APIs (such as Kernel::init) also use a KernelContext to
> > report errors.
> > IMHO, it would be better to avoid the added cognitive overhead of
> handling
> > errors
> > through KernelContext outside hot loops which benefit from it.
> >
> > Am I missing anything? Is there any reason (for example) Kernel::init
> > shouldn't just
> > return a Result>?
> >
> > Ben Kietzman
> >
>


Re: Question about joining two tables

2021-03-15 Thread Benjamin Kietzman
For future design docs/proposals, I've created:
https://cwiki.apache.org/confluence/display/ARROW/from+the+mailing+list

I'll append to that list as I see new docs come up.

On Thu, Mar 11, 2021 at 10:33 PM Aldrin  wrote:

> Great, thanks for the responses! That all makes sense :)
>
> On Thu, Mar 11, 2021 at 1:29 PM Benjamin Kietzman 
> wrote:
>
> > Hi Aldrin,
> >
> > We don't have a unified repository for design docs that I'm aware of.
> > Governance-wise only JIRA and the mailing lists are canonical, but
> > IIUC it'd be legal, straightforward, and beneficial to provide a
> directory
> > like the one you describe of "design docs proposed to the ML" or so.
> >
> > Ben Kietzman
> >
> > On Thu, Mar 11, 2021 at 3:02 PM Wes McKinney 
> wrote:
> >
> > > This is a new document that we just started earlier this week. I'd put
> > > together some docs in the past to try to bootstrap community
> > > organization on this, but since we're now finally putting hands to
> > > code after setting up some critical dependencies (like the Datasets
> > > interface, which is needed to implement scans, and the array
> > > function/kernels framework), the documents + supporting Jira issues
> > > will be more specifically about an implementation plan. We'd welcome
> > > participation from anyone who is interested in this critical project.
> > >
> > > On Thu, Mar 11, 2021 at 1:55 PM Aldrin 
> > wrote:
> > > >
> > > > Hi Ben, thanks for the link!
> > > >
> > > > I will eventually be interested in this direction as well, but hadn't
> > > seen
> > > > this document. Is there a place where these design documents can be
> > > found?
> > > > I've seen this and a few other google doc links floating around the
> > > mailing
> > > > list but I can't figure out how to navigate to a google drive or a
> page
> > > > enumerating the various documents.
> > > >
> > > > Thank you!
> > > >
> > > > Aldrin Montana
> > > > Computer Science PhD Student
> > > > UC Santa Cruz
> > > >
> > > >
> > > > On Thu, Mar 11, 2021 at 10:07 AM Benjamin Kietzman <
> > bengil...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > This is not yet implemented but it is on the roadmap for the near
> > > future:
> > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ
> > > > >
> > > > > Ben Kietzman
> > > > >
> > > > > On Thu, Mar 11, 2021 at 12:33 PM Kirill Lykov <
> > lykov.kir...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Is it possible somehow using existing compute functionality or
> some
> > > other
> > > > > > code to join two tables by values in a common column?
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Kirill Lykov
> > > > > >
> > > > >
> > >
> >
> --
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>


Re: [DISCUSS][C++] Reduce usage of KernelContext in compute::

2021-03-16 Thread Benjamin Kietzman
I've opened https://issues.apache.org/jira/browse/ARROW-11945
to track this improvement.

On Tue, Mar 16, 2021 at 3:40 PM Wes McKinney  wrote:

> It seems fine to me to return Result from KernelInit (as long as a
> context of some kind of still passed in to the function to have the
> possibility of passing other configuration bits), and it would make
> development more convenient to use a common error handling strategy in
> more places
>
> On Fri, Mar 12, 2021 at 8:52 AM Antoine Pitrou  wrote:
> >
> >
> > I wouldn't mind changing those APIs to return a Status.
> > I'll also note that KernelContext::SetStatus() isn't thread-safe.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 12/03/2021 à 11:40, Benjamin Kietzman a écrit :
> > > My primary point is that using KernelContext to hold error statuses is
> > > confusing
> > > since there are more places to check for an error condition. In the
> rest of
> > > the
> > > c++ library we use RETURN_NOT_OK or ARROW_ASSIGN_OR_RAISE to
> > > handle stack unwinding from an error, but in the presence of
> KernelContext
> > > it's
> > > also necessary to check KernelContext::HasError.
> > >
> > > The specific case of a Kernel::init which must allocate actually
> provides
> > > a good example of this: KernelContext includes helper methods
> AllocateBuffer
> > > and AllocateBitmap which use the context's memory pool. These return
> > > Result<>,
> > > whose status must then be checked and errors propagated using
> > > KernelContext::SetStatus (and not ARROW_ASSIGN_OR_RAISE as
> > > in the rest of the c++ library) since Kernel::init doesn't support
> status
> > > returns.
> > >
> > > IMHO it'd increase readability of kernel code to handle errors
> uniformly
> > > wherever possible.
> > >
> > > On Fri, Mar 12, 2021, 01:00 Yibo Cai  wrote:
> > >
> > >> Beside reporting errors, maybe a kernel wants to allocate memory
> through
> > >> KernelContext::memory_pool [1] in Kernel::init?
> > >> I'm not quite sure if this is a valid case. Would like to hear other
> > >> comments.
> > >>
> > >> [1]
> > >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L95
> > >>
> > >> Yibo
> > >>
> > >> On 3/12/21 5:24 AM, Benjamin Kietzman wrote:
> > >>> KernelContext is a tuple consisting of a pointers to an ExecContext
> and
> > >>> KernelState
> > >>> and an error Status. The context's error Status may be set by compute
> > >>> kernels (for
> > >>> example when divide-by-zero would occur) rather than returning a
> Result
> > >> as
> > >>> in the
> > >>> rest of the codebase. IIUC the intent is to avoid branching on
> always-ok
> > >>> Statuses
> > >>> for kernels which don't have an error condition (for example addition
> > >>> without overflow
> > >>> checks).
> > >>>
> > >>> If there's a motivating performance reason for non standard error
> > >>> propagation then
> > >>> we should continue using KernelContext wherever we can benefit from
> it.
> > >>> However,
> > >>> several other APIs (such as Kernel::init) also use a KernelContext to
> > >>> report errors.
> > >>> IMHO, it would be better to avoid the added cognitive overhead of
> > >> handling
> > >>> errors
> > >>> through KernelContext outside hot loops which benefit from it.
> > >>>
> > >>> Am I missing anything? Is there any reason (for example) Kernel::init
> > >>> shouldn't just
> > >>> return a Result>?
> > >>>
> > >>> Ben Kietzman
> > >>>
> > >>
> > >
>


Re: [DISCUSS][C++] Reduce usage of KernelContext in compute::

2021-03-17 Thread Benjamin Kietzman
Incorrect link last time, sorry:
https://issues.apache.org/jira/browse/ARROW-11990

On Tue, Mar 16, 2021 at 4:19 PM Benjamin Kietzman 
wrote:

> I've opened https://issues.apache.org/jira/browse/ARROW-11945
> to track this improvement.
>
> On Tue, Mar 16, 2021 at 3:40 PM Wes McKinney  wrote:
>
>> It seems fine to me to return Result from KernelInit (as long as a
>> context of some kind of still passed in to the function to have the
>> possibility of passing other configuration bits), and it would make
>> development more convenient to use a common error handling strategy in
>> more places
>>
>> On Fri, Mar 12, 2021 at 8:52 AM Antoine Pitrou 
>> wrote:
>> >
>> >
>> > I wouldn't mind changing those APIs to return a Status.
>> > I'll also note that KernelContext::SetStatus() isn't thread-safe.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 12/03/2021 à 11:40, Benjamin Kietzman a écrit :
>> > > My primary point is that using KernelContext to hold error statuses is
>> > > confusing
>> > > since there are more places to check for an error condition. In the
>> rest of
>> > > the
>> > > c++ library we use RETURN_NOT_OK or ARROW_ASSIGN_OR_RAISE to
>> > > handle stack unwinding from an error, but in the presence of
>> KernelContext
>> > > it's
>> > > also necessary to check KernelContext::HasError.
>> > >
>> > > The specific case of a Kernel::init which must allocate actually
>> provides
>> > > a good example of this: KernelContext includes helper methods
>> AllocateBuffer
>> > > and AllocateBitmap which use the context's memory pool. These return
>> > > Result<>,
>> > > whose status must then be checked and errors propagated using
>> > > KernelContext::SetStatus (and not ARROW_ASSIGN_OR_RAISE as
>> > > in the rest of the c++ library) since Kernel::init doesn't support
>> status
>> > > returns.
>> > >
>> > > IMHO it'd increase readability of kernel code to handle errors
>> uniformly
>> > > wherever possible.
>> > >
>> > > On Fri, Mar 12, 2021, 01:00 Yibo Cai  wrote:
>> > >
>> > >> Beside reporting errors, maybe a kernel wants to allocate memory
>> through
>> > >> KernelContext::memory_pool [1] in Kernel::init?
>> > >> I'm not quite sure if this is a valid case. Would like to hear other
>> > >> comments.
>> > >>
>> > >> [1]
>> > >>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L95
>> > >>
>> > >> Yibo
>> > >>
>> > >> On 3/12/21 5:24 AM, Benjamin Kietzman wrote:
>> > >>> KernelContext is a tuple consisting of a pointers to an ExecContext
>> and
>> > >>> KernelState
>> > >>> and an error Status. The context's error Status may be set by
>> compute
>> > >>> kernels (for
>> > >>> example when divide-by-zero would occur) rather than returning a
>> Result
>> > >> as
>> > >>> in the
>> > >>> rest of the codebase. IIUC the intent is to avoid branching on
>> always-ok
>> > >>> Statuses
>> > >>> for kernels which don't have an error condition (for example
>> addition
>> > >>> without overflow
>> > >>> checks).
>> > >>>
>> > >>> If there's a motivating performance reason for non standard error
>> > >>> propagation then
>> > >>> we should continue using KernelContext wherever we can benefit from
>> it.
>> > >>> However,
>> > >>> several other APIs (such as Kernel::init) also use a KernelContext
>> to
>> > >>> report errors.
>> > >>> IMHO, it would be better to avoid the added cognitive overhead of
>> > >> handling
>> > >>> errors
>> > >>> through KernelContext outside hot loops which benefit from it.
>> > >>>
>> > >>> Am I missing anything? Is there any reason (for example)
>> Kernel::init
>> > >>> shouldn't just
>> > >>> return a Result>?
>> > >>>
>> > >>> Ben Kietzman
>> > >>>
>> > >>
>> > >
>>
>


Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Benjamin Kietzman
Hi Michael,

We are targeting grouped aggregation for 4.0 as part of a general query
engine buildout. We also intend to bring DataFrame functionality into core
Arrow (which would probably include an analog of pandas' pivot_table), but
the query engine work is a prerequisite.

Ben Kietzman

On Fri, Mar 19, 2021, 08:19 Michael Lavina 
wrote:

> Hey Team,
>
> Sorry if this is answered already somewhere I tried searching emails and
> issues but couldn’t find anything. I am wondering if there is a standard
> way to encode row or column pivots in Arrow?
>
> I know Pandas does it already some way
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
> and there are libraries using Arrow like Perspective that may have their
> internal solution for representation of pivots
> https://perspective.finos.org/docs/md/view.html#row-pivots
>
> I am wondering if there is already a discussion or already a best practice
> or standard for encoding this information. Or alternatively is this not
> something that should be at all associated with Arrow.
>
> -Michael
>
> P.S. If anyone on the Perspective team or anyone who might know is on this
> thread I would be interested in understanding more how Perspective,
> specifically, encodes pivot information in Arrow.
>
>


Re: Wrap Unwarp Scalars in Cython API

2021-03-22 Thread Benjamin Kietzman
I'm not sure what kind of unwrapping you are looking for, would
pyarrow.scalar and Scalar.as_py address your use case? For example,
pa.scalar(128) will wrap that integer into a Scalar

On Mon, Mar 22, 2021, 11:15 Vibhatha Abeykoon  wrote:

> Hello,
>
> Is there a way to wrap and unwrap Scalars using the Cython API?
>
> I am following the docs:
> https://arrow.apache.org/docs/python/extending.html
> But I couldn't find an option. Not sure if I am following the correct docs.
>
> With Regards,
> Vibhatha Abeykoon,
> PhD Candidate | Research Assistant,
> Digital Science Center,
> Luddy School of Informatics Computing and Engineering,
> Indiana University Bloomington,
> Cell : +1-812-955-1394
> Web: https://www.vibhatha.org
> 
>


Re: [C++] Obtain shared_ptr of an Array from a reference

2021-03-22 Thread Benjamin Kietzman
Would MakeArray(array.data()) work for you?

On Mon, Mar 22, 2021, 23:00 Ying Zhou  wrote:

> Hi,
>
> I know this is a very silly question here but I still prefer to see it
> resolved rather than working on it for a day:
>
> How shall I generate an std::shared_ptr from an Array&? Just taking
> the address and constructing a shared_ptr from the pointer doesn’t work.
>
> Ying


[DISCUSS][C++] Refactoring of Expression simplification passes

2021-05-05 Thread Benjamin Kietzman
Currently, Expressions (used to specify dataset filters and projections)
are simplified by direct rewriting: a filter such as `alpha == 2 and beta >
3`
on a partition where we are guaranteed that `beta == 5` will be rewritten
to `alpha == 2` before evaluation against scanned batches. This can
potentially occur for each scanned batch: for example, Parquet's row group
statistics are used in the same way to simplify filters.

Rewriting is not extremely expensive (a microbenchmark estimate on
my machine shows that a simple case such as the above takes 4ms).
However it does complicate an execution of a logical plan wherein
expressions being evaluated are not identical to the expression with
which the plan was constructed.

It seems it might be preferable to avoid mutating expressions and instead
build a mapping from sub expressions to known values which can be used
by subsequent simplification passes and during execution. I'd have to
benchmark it, but it also seems like we might speed up expression
simplification this way. Any thoughts?


Re: [DISCUSS][C++] Refactoring of Expression simplification passes

2021-05-05 Thread Benjamin Kietzman
Sorry, yes: I meant 4 microseconds and not 4 milliseconds.

On Wed, May 5, 2021 at 1:27 PM Antoine Pitrou  wrote:

> On Wed, 5 May 2021 13:23:36 -0400
> Benjamin Kietzman  wrote:
> > Currently, Expressions (used to specify dataset filters and projections)
> > are simplified by direct rewriting: a filter such as `alpha == 2 and
> beta >
> > 3`
> > on a partition where we are guaranteed that `beta == 5` will be rewritten
> > to `alpha == 2` before evaluation against scanned batches. This can
> > potentially occur for each scanned batch: for example, Parquet's row
> group
> > statistics are used in the same way to simplify filters.
> >
> > Rewriting is not extremely expensive (a microbenchmark estimate on
> > my machine shows that a simple case such as the above takes 4ms).
>
> 4ms for a single rewriting actually sounds quite large to me.
> (or did you mean 4µs?)
>
>
>
>


Re: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman

2021-05-07 Thread Benjamin Kietzman
Thanks, all!

On Thu, May 6, 2021, 22:23 Fan Liya  wrote:

> Congratulations, Ben!
>
> Best,
> Liya Fan
>
> On Fri, May 7, 2021 at 4:23 AM Bryan Cutler  wrote:
>
> > Congrats Ben!
> >
> > On Thu, May 6, 2021 at 12:05 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Congratulations Ben :-)
> > >
> > >
> > > Le 06/05/2021 à 21:02, Rok Mihevc a écrit :
> > > > Congrats!
> > > >
> > > > On Thu, May 6, 2021 at 10:49 AM Krisztián Szűcs <
> > > szucs.kriszt...@gmail.com>
> > > > wrote:
> > > >
> > > >> Congrats Ben!
> > > >>
> > > >> On Thu, May 6, 2021 at 9:20 AM Joris Van den Bossche
> > > >>  wrote:
> > > >>>
> > > >>> Congrats!
> > > >>>
> > > >>> On Thu, 6 May 2021 at 07:03, Weston Pace 
> > > wrote:
> > > >>>
> > > >>>> Congratulations Ben!
> > > >>>>
> > > >>>> On Wed, May 5, 2021 at 6:48 PM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Congrats!
> > > >>>>>
> > > >>>>> On Wed, May 5, 2021 at 4:33 PM David Li 
> > wrote:
> > > >>>>>
> > > >>>>>> Congrats Ben! Well deserved.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> David
> > > >>>>>>
> > > >>>>>> On Wed, May 5, 2021, at 19:22, Neal Richardson wrote:
> > > >>>>>>> Congrats Ben!
> > > >>>>>>>
> > > >>>>>>> Neal
> > > >>>>>>>
> > > >>>>>>> On Wed, May 5, 2021 at 4:16 PM Eduardo Ponce <
> > > >> edponc...@gmail.com
> > > >>>>>> <mailto:edponce00%40gmail.com>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Great news! Congratulations Ben.
> > > >>>>>>>>
> > > >>>>>>>> ~Eduardo
> > > >>>>>>>>
> > > >>>>>>>> 
> > > >>>>>>>> From: Wes McKinney  > > >>>>> wesmckinn%40gmail.com
> > > >>>>>>>>
> > > >>>>>>>> Sent: Wednesday, May 5, 2021, 7:10 PM
> > > >>>>>>>> To: dev
> > > >>>>>>>> Subject: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman
> > > >>>>>>>>
> > > >>>>>>>> The Project Management Committee (PMC) for Apache Arrow has
> > > >> invited
> > > >>>>>>>> Benjamin Kietzman to become a PMC member and we are pleased to
> > > >>>>> announce
> > > >>>>>>>> that Benjamin has accepted.
> > > >>>>>>>>
> > > >>>>>>>> Congratulations and welcome!
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >
> > >
> >
>


[jira] [Created] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`

2019-07-22 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6002:


 Summary: [C++][Gandiva] TestCastFunctions does not test int64 
casting`
 Key: ARROW-6002
 URL: https://issues.apache.org/jira/browse/ARROW-6002
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Benjamin Kietzman


{{outputs[2]}} (corresponds to cast from float32) is checked twice 
https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478
 while {{outputs[1]}} is not checked (corresponds to cast from int64)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6095) [C++] Python subproject ignores ARROW_TEST_LINKAGE

2019-08-01 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6095:


 Summary: [C++] Python subproject ignores ARROW_TEST_LINKAGE
 Key: ARROW-6095
 URL: https://issues.apache.org/jira/browse/ARROW-6095
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


the python subproject links to arrow_python_shared and other shared libraries 
regardless of ARROW_TEST_LINKAGE 
https://github.com/apache/arrow/blob/eb5dd50/cpp/src/arrow/python/CMakeLists.txt#L131-L132



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6120) [C++][Gandiva] including some headers causes decimal_test to fail

2019-08-02 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6120:


 Summary: [C++][Gandiva] including some headers causes decimal_test 
to fail
 Key: ARROW-6120
 URL: https://issues.apache.org/jira/browse/ARROW-6120
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Benjamin Kietzman


It seems this is due to precompiled code being contaminated with undesired 
headers

For example, {{#include }} in {{arrow/compare.h}} causes:

{code}
[ RUN  ] TestDecimal.TestCastFunctions
../../src/gandiva/tests/decimal_test.cc:478: Failure
Value of: (array_dec)->Equals(outputs[2], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:481: Failure
Value of: (array_dec)->Equals(outputs[2], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:484: Failure
Value of: (array_dec)->Equals(outputs[3], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:497: Failure
Value of: (array_float64)->Equals(outputs[6], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  inf,
  inf,
  -inf,
  -inf
]
[  FAILED  ] TestDecimal.TestCastFunctions (134 ms)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6138) [C++] Add a basic (single RecordBatch) implementation of Dataset

2019-08-05 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6138:


 Summary: [C++] Add a basic (single RecordBatch) implementation of 
Dataset
 Key: ARROW-6138
 URL: https://issues.apache.org/jira/browse/ARROW-6138
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


The simplest case for a Dataset is one which views a single RecordBatch. This 
would be a usefully trivial test implementation and could yield some hints 
about API ergonomics



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6226) [C++] refactor Diff and PrettyPrint to share code

2019-08-13 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6226:


 Summary: [C++] refactor Diff and PrettyPrint to share code
 Key: ARROW-6226
 URL: https://issues.apache.org/jira/browse/ARROW-6226
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Diff reimplements a lot of PrettyPrint which didn't quite fit the former's 
required use case and slightly changes the output format. Extract the shared 
code to a header {{pretty_print_internal.h}} which can be used by both and 
update the pretty print tests to reflect the new format.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6228) [C++] Add context lines to Diff formatting

2019-08-13 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6228:


 Summary: [C++] Add context lines to Diff formatting
 Key: ARROW-6228
 URL: https://issues.apache.org/jira/browse/ARROW-6228
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Diff currently renders only inserted or deleted elements, but context lines can 
be helpful to viewers of the diff. Add an option for configurable context line 
count to Diff and EqualOptions



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6229) [C++] Add a DataSource implementation which scans a directory

2019-08-13 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6229:


 Summary: [C++] Add a DataSource implementation which scans a 
directory
 Key: ARROW-6229
 URL: https://issues.apache.org/jira/browse/ARROW-6229
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


DirectoryBasedDataSource should scan a directory (optionally recursively) on 
construction, yielding FileBasedDataFragments



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6257) [C++] Add fnmatch compatible globbing function

2019-08-15 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6257:


 Summary: [C++] Add fnmatch compatible globbing function
 Key: ARROW-6257
 URL: https://issues.apache.org/jira/browse/ARROW-6257
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


This will be useful for the filesystems module and in datasource discovery, 
which uses it



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6275) [C++] Deprecate RecordBatchReader::ReadNext

2019-08-16 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6275:


 Summary: [C++] Deprecate RecordBatchReader::ReadNext
 Key: ARROW-6275
 URL: https://issues.apache.org/jira/browse/ARROW-6275
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


After 6161, RecordBatchReader is a refinement of util::Iterator and the 
ReadNext method is redundant and can be deprecated. (util::Iterator provides 
Next, which has identical semantics.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6299) [C++] Simplify FileFormat classes to singletons

2019-08-20 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6299:


 Summary: [C++] Simplify FileFormat classes to singletons
 Key: ARROW-6299
 URL: https://issues.apache.org/jira/browse/ARROW-6299
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


ParquetFileFormat has no state, so passing it around by shared_ptr 
is not necessary; we could just keep a single static instance and pass raw 
pointers.

[~wesmckinn] is there a case where a FileFormat might have state?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-27 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6368:


 Summary: [C++] Add RecordBatch projection functionality
 Key: ARROW-6368
 URL: https://issues.apache.org/jira/browse/ARROW-6368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


define classes RecordBatchProjector (which projects from one schema to another, 
augmenting with null/constant columns where necessary) and a subtype of 
RecordBatchIterator which projects each batch yielded by a wrapped iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6386) [C++][Documentation] Explicit documentation of null slot interpretation

2019-08-29 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6386:


 Summary: [C++][Documentation] Explicit documentation of null slot 
interpretation
 Key: ARROW-6386
 URL: https://issues.apache.org/jira/browse/ARROW-6386
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


To my knowledge, there isn't explicit documentation on how null slots in an 
array should be interpreted. SQL uses Kleene logic, wherein a null is 
explicitly an unknown rather than a special value. This yields for example 
`(null AND false) -> false`, since `(x AND false) -> false` for all possible 
values of x. This is also the behavior of Gandiva's boolean expressions.

By contrast the boolean kernels implement something closer to the behavior of 
NaN: `(null AND false) -> null`. I think this is simply an error in the boolean 
kernels but in any case I think explicit documentation should be added to 
prevent future confusion.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6398) [C++] consolidate ScanOptions and ScanContext

2019-08-30 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6398:


 Summary: [C++] consolidate ScanOptions and ScanContext
 Key: ARROW-6398
 URL: https://issues.apache.org/jira/browse/ARROW-6398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Currently ScanOptions has two distinct responsibilities: it contains the data 
selector (and eventually projection schema) for the current scan and it serves 
as the base class for format specific scan options.

In addition, we have ScanContext which holds the memory pool for the current 
scan.

I think these classes should be rearranged as follows: ScanOptions will be 
removed and FileScanOptions will be the abstract base class for format specific 
scan options. ScanContext will be a concrete struct and contain the data 
selector, projection schema, a vector of FileScanOptions, and any other shared 
scan state.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


  1   2   >