Re: [VOTE] Donation of rust arrow2 and parquet2

2021-06-28 Thread Ben Kietzman
+1 (binding)

On Mon, Jun 28, 2021 at 5:35 AM Wes McKinney  wrote:

> +1 (binding)
>
> On Mon, Jun 28, 2021 at 11:08 AM Daniël Heres 
> wrote:
> >
> > +1 (non binding)
> >
> > Great work Jorge!
> >
> > On Mon, Jun 28, 2021, 10:26 Weston Steimel 
> wrote:
> >
> > > +1
> > >
> > > On Sun, 27 Jun 2021, 07:41 Jorge Cardoso Leitão, <
> jorgecarlei...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to bring to this mailing list a proposal to donate the
> > > source
> > > > code of arrow2 [1] and parquet2 [2] as experimental repositories [3]
> > > within
> > > > Apache Arrow, conditional on IP clearance.
> > > >
> > > > The specific PRs are:
> > > >
> > > > * https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > > > * https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> > > >
> > > > The source code contains rewrites of the arrow and parquet crates
> with
> > > > safety and security in mind. In particular,
> > > >
> > > > * no buffer transmutes
> > > > * no unsafe APIs marked as safe
> > > > * parquet's implementation is unsafe free
> > > >
> > > > There are many other important features, such as big endian support
> and
> > > IPC
> > > > 2.0 support. There is one regression over latest: support nested
> types in
> > > > parquet read and write. I observe no negative impact on performance.
> > > >
> > > > See a longer discussion in [4] over the reasons why the current rust
> > > > implementation is susceptible to safety violations. In particular,
> many
> > > > core APIs of the crate are considered security vulnerabilities under
> > > > RustSec's [5] definitions, and are difficult to address on its
> current
> > > > design.
> > > >
> > > > I validated that it is possible to migrate DataFusion [6] and Polars
> [7]
> > > > without further code changes.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Accept the code donation as experimental repos.
> > > > [ ] +0
> > > > [ ] -1 Do not accept the code donation as experimental repos
> because...
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/developers/experimental_repos.rst
> > > > [2] https://github.com/jorgecarleitao/arrow2
> > > > [3] https://github.com/jorgecarleitao/parquet2
> > > > [4] https://github.com/jorgecarleitao/arrow2#faq
> > > > [5] https://rustsec.org/
> > > > [6] https://github.com/apache/arrow-datafusion/pull/68
> > > > [7] https://github.com/pola-rs/polars
> > > >
> > >
>


Re: [STRAW POLL] (How) should Arrow define storage for "Instant"s

2021-06-28 Thread Ben Kietzman
C

On Thu, Jun 24, 2021, 18:08 David Li  wrote:

> I would also be in favor of option C, or also E if having that distinction
> in the schema is important to some application.
>
> -David
>
> On Thu, Jun 24, 2021, at 17:16, Andrew Lamb wrote:
> > C
> >
> > On Thu, Jun 24, 2021 at 5:05 PM Rok Mihevc  wrote:
> >
> > > C
> > >
> > > On Thu, Jun 24, 2021 at 9:55 PM Nate Bauernfeind <
> > > natebauernfe...@deephaven.io> wrote:
> > >
> > > > Option C.
> > > >
> > > > On Thu, Jun 24, 2021 at 1:53 PM Joris Peeters <
> > > joris.mg.peet...@gmail.com>
> > > > wrote:
> > > >
> > > > > C
> > > > >
> > > > > On Thu, Jun 24, 2021 at 8:39 PM Antoine Pitrou  >
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Option C.
> > > > > >
> > > > > >
> > > > > > Le 24/06/2021 à 21:24, Weston Pace a écrit :
> > > > > > >
> > > > > > > This proposal states that Arrow should define how to encode an
> > > > Instant
> > > > > > > into Arrow data.  There are several ways this could happen,
> some
> > > > which
> > > > > > > change schema.fbs and some which do not.
> > > > > > > ---
> > > > > > >
> > > > > > > For sample arguments (currently grouped as "for changing
> > > schema.fbs"
> > > > > > > and "against changing schema.fbs") see [2].  For a detailed
> > > > definition
> > > > > > > of the terms LocalDateTime, ZonedDateTime, and Instant and a
> > > > > > > discussion of their semantics see [3].
> > > > > > >
> > > > > > > Options:
> > > > > > >
> > > > > > > A) Do nothing, don’t introduce the nuance of “instants” into
> Arrow
> > > > > > > B) Do nothing, but update the comments in schema.fbs to
> acknowledge
> > > > > > > the existence of the concept and explain that implementations
> are
> > > > free
> > > > > > > to decide if/how to support the type.
> > > > > > > C) Define timestamp with timezone “UTC” as “instant”.
> > > > > > > D) Add a first class instant type to schema.fbs
> > > > > > > E) Add instant as a canonical extension type
> > > > > > >
> > > > > > > Note: This is just a straw poll and the results will not be
> binding
> > > > in
> > > > > > > any way but will help craft a future vote.  For example, if the
> > > > > > > plurality of votes goes to C but a majority of votes is spread
> > > across
> > > > > > > A & B then some flavor of A/B would likely be pursued.
> > > > > > >
> > > > > > > Vote for as many options as you would like.
> > > > > > >
> > > > > > > I will summarize and send out the results in 72 hours.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > >
> >
>


Re: [VOTE] Accept donation of Rust Ballista project

2021-03-21 Thread Ben Kietzman
+1 (non binding)

On Sun, Mar 21, 2021, 12:06 Wes McKinney  wrote:

> +1 (binding)
>
> Since Ballista has ~40 contributors but AFAIK no corporation that
> needs to make a Software Grant, it might be worth consulting with the
> Incubator folks to see what kind of due diligence needs to be done so
> we are covering our bases. I doubt it will be practical (or even
> possible) to collect CLAs from everyone.
>
> On Sun, Mar 21, 2021 at 11:56 AM Andy Grove  wrote:
> >
> > Dear all,
> >
> > On behalf of the Ballista community, I would like to propose that we
> donate
> > Ballista to the Apache Arrow project.
> >
> > Ballista is a distributed scheduler based on Arrow standards (memory
> > format, IPC, Flight) and supports distributed query execution with the
> > DataFusion query engine.
> >
> > The community has had an opportunity to discuss this [1] and there do not
> > seem to be objections to this.
> >
> > The code donation in the form of a pull request:
> >
> > https://github.com/apache/arrow/pull/9723
> >
> > This vote is to determine if the Arrow PMC is in favor of accepting this
> > donation. If the vote passes, the PMC and the authors of the code will
> work
> > together to complete the ASF IP Clearance process (
> > http://incubator.apache.org/ip-clearance/) and import this Rust codebase
> > implementation into Apache Arrow.
> >
> > [ ] +1 : Accept contribution of Ballista [ ] 0 : No opinion [ ] -1 :
> Reject
> > contribution because...
> >
> > Here is my vote: +1
> >
> > The vote will be open for at least 72 hours.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1]
> >
> https://lists.apache.org/x/thread.html/r09556898c9c94259c00e35c04ea051040931bbe9ce577cba60c148c8@%3Cdev.arrow.apache.org%3E
>


Re: [C++] Breakpoints and VSCode integration

2021-02-25 Thread Ben Kietzman
Hi Ying,

You could also try the --gtest_break_on_failure flag (or equivalently the
GTEST_BREAK_ON_FAILURE=1 environment).

Ben


On Thu, Feb 25, 2021, 05:00 Antoine Pitrou  wrote:

>
> Hi Ying,
>
> Have you tried using the given test executable as a debug target?
> (something like build/debug/arrow-orc-writer.exe)
>
> Also, it has various command-line options to change behaviour and narrow
> down the tests (I suggest trying --gtest_filter=...).
>
> Regards
>
> Antoine.
>
>
> Le 25/02/2021 à 09:22, Ying Zhou a écrit :
> > Hi,
> >
> > To facilitate faster debugging I’d like to integrate make unittest
> debugging into VSCode (on Mac) so that when I run a test that might show
> some bugs breakpoints can stop the execution so that I can dig around a
> bit. Does anyone know how that can be done? I know it is a stupid question
> but it does need to be addressed so that I can finish the ORC writer with
> visitors ASAP.
> >
> > Thanks,
> > Ying
> >
>


Re: Any standard way for min/max values per record-batch?

2021-02-18 Thread Ben Kietzman
Unfortunately FieldNode is a `struct` instead of a `table`, so fields may
not be added or deprecated.

On Thu, Feb 18, 2021, 04:38 Antoine Pitrou  wrote:

>
> Le 18/02/2021 à 04:37, Micah Kornfield a écrit :
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.
>
> Is that actually true?  I may be mistaken, but it seems Flatbuffers
> supports appending new fields to a table while ensuring both backwards
> and forwards compatibility:
> https://google.github.io/flatbuffers/md__schemas.html
>
> (see "schema evolution" at the end of this document)
>
> Regards
>
> Antoine.
>


Re: Using arrow/compute/kernels/*internal.h headers

2020-11-08 Thread Ben Kietzman
Hi Niranda,

SumImpl is a subclass of KernelState. Given a SumAggregateKernel, one can
produce zeroed KernelState using the `init` member, then operate on data
using the `consume`, `merge`, and `finalize` members. You can look at
ScalarAggExecutor for an example of how to get from a compute function to
kernels and kernel state. Will that work for you?

Ben Kietzman

On Sun, Nov 8, 2020, 11:21 Niranda Perera  wrote:

> Hi Ben,
>
> We are building a distributed table abstraction on top of Arrow dataframes
> called Cylon (https://github.com/cylondata/cylon). Currently we have a
> simple aggregation and group-by operation implementation. But we felt like
> we can give more functionality if we can import arrow kernels and states to
> corresponding cylon distributed kernels.
> Ex: For distributed mean, we would have to communicate the local arrow
> SumState and then do a SumImpl::MergeFrom() and the call Finalize.
> Is there any other way to access these intermediate states from compute
> operations?
>
> On Sun, Nov 8, 2020 at 11:11 AM Ben Kietzman 
> wrote:
>
> > Ni Niranda,
> >
> > What is the context of your work? if you're working inside the arrow
> > repository you shouldn't need to install headers before using them, and
> we
> > welcome PRs for new kernels. Otherwise, could you provide some details
> > about how your work is using Arrow as a dependency?
> >
> > Ben Kietzman
> >
> > On Sun, Nov 8, 2020, 10:57 Niranda Perera 
> > wrote:
> >
> > > Hi,
> > >
> > > I was wondering if I could use the arrow/compute/kernels/*internal.h
> > > headers in my work? I would like to reuse some of the kernel
> > > implementations and kernel states.
> > >
> > > With -DARROW_COMPUTE=ON, those headers are not added into the include
> > dir.
> > > I see that the *internal.h headers are skipped from
> > > the ARROW_INSTALL_ALL_HEADERS cmake function unfortunately.
> > >
> > > Best
> > > --
> > > Niranda Perera
> > > @n1r44 <https://twitter.com/N1R44>
> > > +1 812 558 8884 / +94 71 554 8430
> > > https://www.linkedin.com/in/niranda
> > >
> >
>
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda
>


Re: Using arrow/compute/kernels/*internal.h headers

2020-11-08 Thread Ben Kietzman
Ni Niranda,

What is the context of your work? if you're working inside the arrow
repository you shouldn't need to install headers before using them, and we
welcome PRs for new kernels. Otherwise, could you provide some details
about how your work is using Arrow as a dependency?

Ben Kietzman

On Sun, Nov 8, 2020, 10:57 Niranda Perera  wrote:

> Hi,
>
> I was wondering if I could use the arrow/compute/kernels/*internal.h
> headers in my work? I would like to reuse some of the kernel
> implementations and kernel states.
>
> With -DARROW_COMPUTE=ON, those headers are not added into the include dir.
> I see that the *internal.h headers are skipped from
> the ARROW_INSTALL_ALL_HEADERS cmake function unfortunately.
>
> Best
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda
>


Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-21 Thread Ben Kietzman
FWIW boost.coroutine and boost.asio provide composable coroutines,
non blocking IO, and configurable scheduling for CPU work out of the box.

The boost libraries are not lightweight but they are robust and
cross platform, so I think asio is worth consideration.

On Sat, Sep 19, 2020 at 8:22 PM Wes McKinney  wrote:

> I took a look at https://github.com/kpamnany/partr and Julia's
> production iteration of that -- kpamnany/partr depends on
> libconcurrent's coroutine implementation which does not work on
> Windows. It appears that Julia is using libuv instead. If we're
> looking for a lighter-weight C coroutine implementation, there is
> http://software.schmorp.de/pkg/libcoro.html, but either way there is
> quite a bit of systems work to create something that can work for
> Arrow.
>
> I don't have an intuition whether depth-first scheduling (what Julia
> is doing) or breadth-first scheduling (aka "work stealing" -- which is
> what Intel's TBB library does [1]) will work better for our use cases.
> But I believe that we need to figure out a programming model (probably
> based on composable futures and continuations given what we are
> already doing) that hides the details of which coroutine/threading
> runtime.
>
> A follow-on project would likely be to define a non-blocking API for
> our various IO interfaces that composes with the rest of the thread
> scheduling machinery.
>
> Either way, this problem is definitely non-trivial so we should figure
> out what "default" approach we can implement that is compatible with
> our "minimal dependency core build" approach in C++ (which may involve
> vendoring some third party code, but not sure if vendoring TBB is a
> good idea) and go and do that. If anyone would like to be funded to
> work on this problem, please get in touch with me offline.
>
> Thanks
> Wes
>
> [1]:
> https://software.intel.com/content/www/us/en/develop/blogs/the-work-isolation-functionality-in-intel-threading-building-blocks-intel-tbb.html
>
> On Sat, Sep 19, 2020 at 5:21 PM Weston Pace  wrote:
> >
> > Ok, my skill with C++ got in the way of my ability to put something
> > together.  First, I did not realize that C++ futures were a little
> > different than the definition I'm used to for futures.  By default,
> > C++ futures are not composable, you can't add continuations with
> > `then`, `when_all` or `when_any`.  There is an extension for this (not
> > sure if it will make it even in C++20) and there are continuations for
> > futures in boost's futures.  However, since arrow is currently using
> > its own future implementation I could not use either of these
> > libraries.  I spent a bit trying to add continuations to arrow's
> > future implementation but my lack of skill with C++ got in the way.  I
> > want to keep working on it but it may be a few days.  In the meantime
> > I will try and type up something more complete (with a few diagrams)
> > to explain what I'm intending.
> >
> > Having looked at the code for a while I do have a better sense of what
> > is involved.  I think it would be a pretty extensive set of changes.
> > Also, it looks like C++20 is planning on adopting co-routines which
> > they will be using for sequential async.  So perhaps it makes more
> > sense to go directly to coroutines instead of moving to composable
> > futures and then later to coroutines at some point in the future.
> >
> > Also, re: Julia, I looked into it a bit further and Julia is using
> > libuv under the hood for all file I/O (which is non-blocking I/O).
> > Also async/await are built into the bones of Julia.  As far as I can
> > tell from my brief examination is that there is no way to have a Julia
> > task that is performing blocking I/O (in the sense that a "thread pool
> > thread" is blocked on I/O.  You can have blocking I/O in the
> > async/await sense where you are awaiting on I/O to maintain sequential
> > semantics.
> >
> > On Wed, Sep 16, 2020 at 8:10 AM Weston Pace 
> wrote:
> > >
> > > If you want to specifically look at the problem of dataset scanning,
> > > file scanning, and nested parallelism then probably the lowest effort
> > > improvement would be to eliminate the whole idea of "scan threads".
> > > You currently have...
> > >
> > > for (size_t i = 0; i < readers.size(); ++i) {
> > > ARROW_ASSIGN_OR_RAISE(futures[i], pool->Submit(ReadColumnFunc,
> i));
> > > }
> > > Status final_status;
> > > for (auto& fut : futures) {
> > > final_status &= fut.status();
> > > }
> > > // Hiding some follow-up aggregation and the next line is a bit
> abbreviated
> > > return Validate();
> > >
> > > You're already using futures so it would be pretty straightforward to
> > > change that to
> > >
> > > for (size_t i = 0; i < readers.size(); ++i) {
> > > ARROW_ASSIGN_OR_RAISE(futures[i], pool->Submit(ReadColumnFunc,
> i));
> > > }
> > > // Hiding some follow-up aggregation and the next line is a bit
> abbreviated
> > > return
> 

Re: Sort int tuples across Arrow arrays in C++

2020-09-03 Thread Ben Kietzman
Hi Rares,

The arrow API does not currently support sorting against multiple columns.
We'd welcome a JIRA/PR to add that support.

One potential workaround is storing the tuple as a single column of
fixed_size_list(int32, 2), which could then be viewed [1] as int64 (for
which sorting
is supported). Would that accommodate your use case?

Ben

[1]:
https://github.com/apache/arrow/blob/e1e3188/cpp/src/arrow/array/array_base.h#L132-L138

On Thu, Sep 3, 2020 at 8:26 AM Rares Vernica  wrote:

> Hello,
>
> I have a set of integer tuples that need to be collected and sorted at a
> coordinator. Here is an example with tuples of length 2:
>
> [(1, 10),
>  (1, 15),
>  (2, 10),
>  (2, 15)]
>
> I am considering storing each column in an Arrow array, e.g., [1, 1, 2, 2]
> and [10, 15, 10, 15], and have the Arrow arrays grouped in a Record Batch.
> Then I would serialize, transfer, and deserialize each record batch. The
> coordinator would collect all the record batches and concatenate them.
> Finally, the coordinator needs to sort the tuples by value in the
> sequential order of the columns, e.g., (1, 10), (1, 15), (2, 10).
>
> Could I accomplish the sort using the Arrow API? I looked at sort_indices
> but it does not work on record batches. With a set of sort indices for each
> array, sorting the tuples does not seem to be straightforward, right?
>
> Thanks!
> Rares
>


Re: Arrow Dataset API on Ceph

2020-08-31 Thread Ben Kietzman
> as far as we can tell, this filesystem layer
> is unaware of expressions, record batches, etc

You're correct that the filesystem layer doesn't directly support
Expressions.
However the datasets API includes the Partitioning classes which embed
expressions in paths. Depending on what expressions etc you need to embed,
you could implement a RadosFileSystem class which wraps an IO context
and treat object names as paths. If the RADOS objects contain arrow
formatted
data, then a FileSystemDataset (using IpcFileFormat) can be constructed
which
views the IO context and exploits the partitioning information embedded in
object
names to accelerate filtering. Does that accommodate your use case?

> Our main concern is that this new arrow::dataset::RadosFormat class will
be
> deriving from the arrow::dataset::FileFormat class, which seems to raise a
> conceptual mismatch as there isn’t really a RADOS format

IIUC RADOS doesn't interact with a filesystem directly, so RadosFileFormat
would
indeed be a conceptually problematic point of extension. If a RADOS file
system
is not viable then I think the ideal approach would be to directly
implement the
Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
implementation altogether.
Unfortunately the only example we have of this approach is InMemoryFragment,
which simply wraps a vector of record batches.

[1]
https://github.com/apache/arrow/blob/975f4eb/cpp/src/arrow/dataset/dataset.h#L45-L90
[2]
https://github.com/apache/arrow/blob/975f4eb/cpp/src/arrow/dataset/dataset.h#L119-L158


On Fri, Aug 28, 2020 at 1:27 PM Ivo Jimenez  wrote:

> Hi Antoine
>
> > Yes, that is our plan. Since this is going to be done on the storage-,
> > > server-side, this would be transparent to the client. So our main
> concern
> > > is whether this be OK from the design perspective, and could this
> > > eventually be merged upstream?
> >
> > Arrow datasets have no notion of client and server, so I'm not sure what
> > you mean here.
>
>
> Sorry for the confusion. This is where we see a mismatch between the
> current design and what we are trying to achieve.
>
> Our goal is to push down computations in a cloud storage system. By pushing
> we mean actually sending computation tasks to storage nodes (e.g. filtering
> executing on storage nodes). Ideally this would be done by implementing a
> new plugin for arrow::fs but as far as we can tell, this filesystem layer
> is unaware of expressions, record batches, etc. so this information cannot
> be communicated down to storage.
>
> So what we thought would work is to implement this at the Dataset API
> level, and implement a scanner (and writer) that would be deferring these
> operations to storage nodes. For example, the RadosScanTask class will ask
> a storage node to actually do a scan and fetch the result, as opposed to do
> the scan locally.
>
> We would immensely appreciate it if you could let us know if the above is
> OK, or if you think there is a better alternative for accomplishing this,
> as we would rather implement this functionality in a way that is
> compatible with your overall vision.
>
>
> > Do you simply mean contributing RadosFormat to the Arrow
> > codebase?
>
>
> Yes, so that others wanting to achieve this on a Ceph cluster could
> leverage this as well.
>
>
> > I would say that depends on the required dependencies, and
> > ease of testing (and/or CI) for other developers.
>
>
> OK, yes we will pay attention to these aspects as part of an eventual PR.
> We will include tests and ensure that CI covers the changes we introduce.
>
> thanks!
>


Re: Arrow sync call July 22 at 12:00 US/Eastern, 16:00 UTC

2020-07-23 Thread Ben Kietzman
Notes from the biweekly call:

- Discussion of timezone handling (ARROW-9223
,ARROW-9528
) and related patches
  - https://github.com/apache/arrow/pull/7805 (closed in favor of:)
  - https://github.com/apache/arrow/pull/7816
  - Better documentation on timestamp support is warranted, perhaps existing
doc/docstrings can be refactored/given better visibility:
-
http://arrow.apache.org/docs/cpp/api/datatype.html?highlight=timestamptype#_CPPv4N5arrow13TimestampTypeE
-
https://github.com/apache/arrow/blob/b3c8631/format/Schema.fbs#L236-L240
- Discussion of releasing 1.0 with the same erroneous behavior in the
  previous release, and following up in a patch release a few weeks
afterward

On Wed, Jul 22, 2020 at 11:16 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> Reminder that our biweekly call is coming up at the top of the hour at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be sent out to the mailing list afterward.
>
> Neal
>


Re: [Discuss] Format to use when casting temporal arrays to string

2020-07-14 Thread Ben Kietzman
string -> date32 is not implemented, AFAICT.

I agree that a timestamp seems ideal, that way string -> date32 should
produce the same result as string -> timestamp -> date32.

A related question: what format would be expected for time32 <->
string?

On Tue, Jul 14, 2020 at 1:19 PM Antoine Pitrou  wrote:

>
> How is the other side (cast string -> date32) implemented?
>
> I would say, ideally a timestamp is accepted.
>
>
> Le 14/07/2020 à 19:07, Ben Kietzman a écrit :
> > When casting (for example) date32 -> string, should the result be the
> > digits of the underlying integer value or a timestamp?
> >
> > For timestamp -> string the format should probably be ISO8601 since that
> is
> > the format used when casting string -> timestamp (if a different format
> is
> > used then string -> timestamp -> string will not roundtrip perfectly,
> which
> > seems counterintuitive).
> >
>


[Discuss] Format to use when casting temporal arrays to string

2020-07-14 Thread Ben Kietzman
When casting (for example) date32 -> string, should the result be the
digits of the underlying integer value or a timestamp?

For timestamp -> string the format should probably be ISO8601 since that is
the format used when casting string -> timestamp (if a different format is
used then string -> timestamp -> string will not roundtrip perfectly, which
seems counterintuitive).


Re: [VOTE] Removing validity bitmap from Arrow union types

2020-06-29 Thread Ben Kietzman
+1 (non binding)

On Tue, Jun 30, 2020, 00:24 Wes McKinney  wrote:

> +1 (binding)
>
> On Mon, Jun 29, 2020 at 11:09 PM Micah Kornfield 
> wrote:
> >
> > +1 (binding) (I had a couple of nits on language, that I put in the PR
> >
> > On Mon, Jun 29, 2020 at 2:24 PM Wes McKinney 
> wrote:
> >
> > > Hi,
> > >
> > > As discussed on the mailing list [1], it has been proposed to remove
> > > the validity bitmap buffer from Union types in the columnar format
> > > specification and instead let value validity be determined exclusively
> > > by constituent arrays of the union.
> > >
> > > One of the primary motivations for this is to simplify the creation of
> > > unions, since constructing a validity bitmap that merges the
> > > information contained in the child arrays' bitmaps is quite
> > > complicated.
> > >
> > > Note that change breaks IPC forward compatibility for union types,
> > > however implementations with hitherto spec-compliant union
> > > implementations would be able to (at their discretion, of course)
> > > preserve backward compatibility for deserializing "old" union data in
> > > the case that the parent null count of the union is zero. The expected
> > > impact of this breakage is low, particularly given that Unions have
> > > been absent from integration testing and thus not recommended for
> > > anything but ephemeral serialization.
> > >
> > > Under the assumption that the MetadataVersion V4 -> V5 version bump is
> > > accepted, in order to protect against forward compatibility problems,
> > > Arrow implementations would be forbidden from serializing union types
> > > using the MetadataVersion::V4.
> > >
> > > A PR with the changes to Columnar.rst is at [2].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Accept changes to Columnar.rst (removing union validity bitmaps)
> > > [ ] +0
> > > [ ] -1 Do not accept changes because...
> > >
> > > [1]:
> > >
> https://lists.apache.org/thread.html/r889d7532cf1e1eff74b072b4e642762ad39f4008caccef5ecde5b26e%40%3Cdev.arrow.apache.org%3E
> > > [2]: https://github.com/apache/arrow/pull/7535
> > >
>


Re: [VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 release

2020-06-29 Thread Ben Kietzman
+1 (non binding)

On Tue, Jun 30, 2020, 00:25 Wes McKinney  wrote:

> +1 (binding)
>
> On Mon, Jun 29, 2020 at 10:49 PM Micah Kornfield 
> wrote:
> >
> > +1 (binding)
> >
> > On Mon, Jun 29, 2020 at 2:43 PM Wes McKinney 
> wrote:
> >
> > > Hi,
> > >
> > > As discussed on the mailing list [1], in order to demarcate the
> > > pre-1.0.0 and post-1.0.0 worlds, and to allow the
> > > forward-compatibility-protection changes we are making to actually
> > > work (i.e. so that libraries can recognize that they have received
> > > data with a feature that they do not support), I have proposed to
> > > increment the MetadataVersion from V4 to V5. Additionally, if the
> > > union validity bitmap changes are accepted, the MetadataVersion could
> > > be used to control whether unions are permitted to be serialized or
> > > not (with V4 -- used by v0.8.0 to v0.17.1, unions would not be
> > > permitted).
> > >
> > > Since there have been no backward incompatible changes to the Arrow
> > > format since 0.8.0, this would be no different, and (aside from the
> > > union issue) libraries supporting V5 are expected to accept BOTH V4
> > > and V5 so that backward compatibility is not broken, and any
> > > serialized data from prior versions of the Arrow libraries (0.8.0
> > > onward) will continue to be readable.
> > >
> > > Implementations are recommended, but not required, to provide an
> > > optional "V4 compatibility mode" for forward compatibility
> > > (serializing data from >= 1.0.0 that needs to be readable by older
> > > libraries, e.g. Spark deployments stuck on an older Java-Arrow
> > > version). In this compatibility mode, non-forward-compatible features
> > > added in 1.0.0 and beyond would not be permitted.
> > >
> > > A PR with the changes to Schema.fbs (possibly subject to some
> > > clarifying changes to the comments) is at [2].
> > >
> > > Once the PR is merged, it will be necessary for implementations to be
> > > updated and tested as appropriate at minimum to validate that backward
> > > compatibility is preserved (i.e. V4 IPC payloads are still readable --
> > > we have some in apache/arrow-testing and can add more as needed).
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Accept addition of MetadataVersion::V5 along with its general
> > > implications above
> > > [ ] +0
> > > [ ] -1 Do not accept because...
> > >
> > > [1]:
> > >
> https://lists.apache.org/thread.html/r856822cc366d944b3ecdf32c2ea9b1ad8fc9d12507baa2f2840a64b6%40%3Cdev.arrow.apache.org%3E
> > > [2]: https://github.com/apache/arrow/pull/7566
> > >
>


Re: [VOTE] Permitting unsigned integers for Arrow dictionary indices

2020-06-29 Thread Ben Kietzman
+1 (non binding)

On Mon, Jun 29, 2020, 18:00 Wes McKinney  wrote:

> Hi,
>
> As discussed on the mailing list [1], it has been proposed to allow
> the use of unsigned dictionary indices (which is already technically
> possible in our metadata serialization, but not allowed according to
> the language of the columnar specification), with the following
> caveats:
>
> * Unless part of an application's requirements (e.g. if it is
> necessary to store dictionaries with size 128 to 255 more compactly),
> implementations are recommended to prefer signed over unsigned
> integers, with int32 continuing to be the "default" when the indexType
> field of DictionaryEncoding is null
> * uint64 dictionary indices, while permitted, are strongly not
> recommended unless required by an application as they are more
> difficult to work with in some programming languages (e.g. Java) and
> they do not offer the storage size benefits that uint8 and uint16 do.
>
> This change is backwards compatible, but not forward compatible for
> all implementations (for example, C++ will reject unsigned integers).
> Assuming that the V5 MetadataVersion change is accepted, to protect
> against forward compatibility issues such implementations would be
> recommended to not allow unsigned dictionary indices to be serialized
> using V4 MetadataVersion.
>
> A PR with the changes to the columnar specification (possibly subject
> to some clarifying language) is at [2].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Accept changes to allow unsigned integer dictionary indices
> [ ] +0
> [ ] -1 Do not accept because...
>
> [1]:
> https://lists.apache.org/thread.html/r746e0a76c4737a2cf48dec656103677169bebb303240e62ae1c66d35%40%3Cdev.arrow.apache.org%3E
> [2]: https://github.com/apache/arrow/pull/7567
>


[jira] [Created] (ARROW-9120) [C++] Lint and Format _internal headers

2020-06-12 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-9120:
---

 Summary: [C++] Lint and Format _internal headers
 Key: ARROW-9120
 URL: https://issues.apache.org/jira/browse/ARROW-9120
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Currently, headers named /*_internal.h/ are neither clang-formatted nor 
cpplinted. Since they're not exported, CLI lint (forbid , nullptr, ...) 
need not be applied



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9081) [C++] Upgrade to LLVM 10

2020-06-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-9081:
---

 Summary: [C++] Upgrade to LLVM 10
 Key: ARROW-9081
 URL: https://issues.apache.org/jira/browse/ARROW-9081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Upgrade llvm dependencies to use version 10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Add kernel integer overflow handling

2020-06-03 Thread Ben Kietzman
https://github.com/apache/arrow/pull/7341#issuecomment-638241193

How should arithmetic kernels handle integer overflow?

The approach currently taken in the linked PR is to promote such that
overflow will not occur, for example `(int8, int8)->int16` and `(uint16,
uint16)->uint32`.

I'm not sure that's desirable. For one thing this leads to inconsistent
handling of 64 bit integer types, which are currently allowed to overflow
since we cannot promote further (NB: that means this kernel includes
undefined behavior for int64).

There are a few other approaches we could take (ordered by personal
preference):

   - define explicit overflow behavior for signed integer operands (for
   example if we declared that add(i8(a), i8(b)) will always be equivalent
   to i8(i16(a) + i16(b)) then we could instantiate only unsigned addition
   kernels)
   - raise an error on signed overflow
   - provide ArithmeticOptions::overflow_behavior and allow users to choose
   between these
   - require users to pass arguments which will not overflow


[jira] [Created] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

2020-05-29 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8981:
---

 Summary: [C++][Dataset] Add support for compressed FileSources
 Key: ARROW-8981
 URL: https://issues.apache.org/jira/browse/ARROW-8981
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
 Fix For: 1.0.0


FileSource::compression_ is currently ignored. Ideally files/buffers which are 
compressed could be decompressed on read. See ARROW-8942



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow sync call May 13 at 12:00 US/Eastern, 16:00 UTC

2020-05-13 Thread Ben Kietzman
Attendees:
Micah Kornfield
Francois Saint-Jacques
Ben Kietzma
Uwe Korn
Mahmut Bulut
Remi Dettai
Prudhvi Porandla

Discussion:
* How to become a committer
  * to the rust subproject
* SIMD dispatch
  * Following mailing list discussion "[C++] Runtime SIMD dispatching for
Arrow"
  * How can writing code which uses SIMD intrinsics be made less onerous?
  * Should dispatch be handled with
* individual functions (simpler)
* separate library (lower library load overhead at runtime)
  * Some refactoring of the build system will probably be required to allow
some
source files to be compiled with customized flags
* Nested parquet: Antoine observed a performance regression introduced by
  Micah's pre-nested refactoring, further investigation required

On Wed, May 13, 2020 at 11:35 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> Last minute reminder that our biweekly call is coming up in a half hour at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be sent out to the mailing list afterward.
>
> Neal
>


[jira] [Created] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

2020-04-30 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8658:
---

 Summary: [C++][Dataset] Implement subtree pruning for 
FileSystemDataset::GetFragments
 Key: ARROW-8658
 URL: https://issues.apache.org/jira/browse/ARROW-8658
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This is a very handy optimization for large datasets with multiple partition 
fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a 
filter {{"a"_ == 2}} none of its files or subdirectories need be examined.

After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose 
implementation depended on the presence of directories to represent subtrees) 
was disabled. It should be possible to reintroduce this without reference to 
directories by examining partition expressions directly and extracting a tree 
structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8632) [C++] Fix conversion error warning in array_union_test.cc

2020-04-29 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8632:
---

 Summary: [C++] Fix conversion error warning in array_union_test.cc
 Key: ARROW-8632
 URL: https://issues.apache.org/jira/browse/ARROW-8632
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0



https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/3257/job/c4f2kqcsm04gjd8u#L1074




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8631) [C++][Dataset] Add ConvertOptions and ReadOptions to CsvFileFormat

2020-04-29 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8631:
---

 Summary: [C++][Dataset] Add ConvertOptions and ReadOptions to 
CsvFileFormat
 Key: ARROW-8631
 URL: https://issues.apache.org/jira/browse/ARROW-8631
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


https://github.com/apache/arrow/pull/7033 does not add ConvertOptions 
(including alternate spellings for null/true/false, etc) or ReadOptions 
(block_size, column name customization, etc). These will be helpful but will 
require some discussion to find the optimal way to integrate them with dataset::



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8630) [C++][Dataset] Pass schema including all materialized fields to catch CSV edge cases

2020-04-29 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8630:
---

 Summary: [C++][Dataset] Pass schema including all materialized 
fields to catch CSV edge cases
 Key: ARROW-8630
 URL: https://issues.apache.org/jira/browse/ARROW-8630
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


see discussion here 
https://github.com/apache/arrow/pull/7033#discussion_r416941674

Fields filtered but not projected will revert to their inferred type, whatever 
their dataset's schema may be. This can cause validated filters to fail due to 
type disagreements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8618) [C++] ASSIGN_OR_RAISE should move its argument

2020-04-28 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8618:
---

 Summary: [C++] ASSIGN_OR_RAISE should move its argument
 Key: ARROW-8618
 URL: https://issues.apache.org/jira/browse/ARROW-8618
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Since {{ASSIGN_OR_RAISE}} consumes its {{Result}} argument anyway, there's no 
reason not to cast that argument to an rvalue reference whenever possible. This 
will decrease boilerplate when handling non-temporary {{Result}}s, for example 
when yielding from an iterator:

{code:diff}
 for (auto maybe_batch : scan_task->Execute()) {
-  ASSIGN_OR_RAISE(auto batch, std::move(maybe_batch));
+  ASSIGN_OR_RAISE(auto batch, maybe_batch);
 }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8472) [Go][Integration] Represent 64 bit integers as JSON::string

2020-04-15 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8472:
---

 Summary: [Go][Integration] Represent 64 bit integers as 
JSON::string
 Key: ARROW-8472
 URL: https://issues.apache.org/jira/browse/ARROW-8472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


see ARROW-6407



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8471) [C++][Integration] Regression to /u?int64/ as JSON::number

2020-04-15 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8471:
---

 Summary: [C++][Integration] Regression to /u?int64/ as JSON::number
 Key: ARROW-8471
 URL: https://issues.apache.org/jira/browse/ARROW-8471
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


In moving datagen.py under archery, the fix for ARROW-6310 was clobbered out 
resulting in representing 64 bit integers as numbers in integration JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8434) [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times

2020-04-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8434:
---

 Summary: [C++] Ipc RecordBatchFileReader deserializes the Schema 
multiple times
 Key: ARROW-8434
 URL: https://issues.apache.org/jira/browse/ARROW-8434
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


This extra work is redundant and should be skipped

https://github.com/apache/arrow/blob/0c5296d3353e494bbf65c8cbc6b56525db6ae084/cpp/src/arrow/ipc/reader.cc#L800



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8432) [Python][CI] Failure to download Hadoop

2020-04-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8432:
---

 Summary: [Python][CI] Failure to download Hadoop
 Key: ARROW-8432
 URL: https://issues.apache.org/jira/browse/ARROW-8432
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


https://circleci.com/gh/ursa-labs/crossbow/11128?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

This is caused by an HTTP request failure 
https://github.com/apache/arrow/blob/master/ci/docker/conda-python-hdfs.dockerfile#L36

We should probably not rely on https://www.apache.org/dyn/mirrors/mirrors.cgi 
to get tarballs. Currently there are three:

{code}
ci/docker/conda-python-hdfs.dockerfile
36:RUN wget -q -O - 
"https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=hadoop/common/hadoop-${hdfs}/hadoop-${hdfs}.tar.gz;
 | tar -xzf - -C /opt

ci/docker/linux-apt-docs.dockerfile
57:RUN wget -q -O - 
"https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=maven/maven-3/${maven}/binaries/apache-maven-${maven}-bin.tar.gz;
 | tar -xzf - -C /opt

python/manylinux1/scripts/build_thrift.sh
22:  
"https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=${THRIFT_DOWNLOAD_PATH};
 \
{code}

Factor these out into a reusable script for downloading apache tarballs. It 
should contain hard coded apache mirrors and retry when connections fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8428) [CI][NIGHTLY:gandiva-jar-trusty] GCC 4.8 failures in C++ unit tests

2020-04-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8428:
---

 Summary: [CI][NIGHTLY:gandiva-jar-trusty] GCC 4.8 failures in C++ 
unit tests
 Key: ARROW-8428
 URL: https://issues.apache.org/jira/browse/ARROW-8428
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


See https://issues.apache.org/jira/browse/ARROW-8388

Not reported by the CI job added in that issue since manylinux1 doesn't 
currently build the c++ unit tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8388) [C++] GCC 4.8 fails to move on return

2020-04-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8388:
---

 Summary: [C++] GCC 4.8 fails to move on return
 Key: ARROW-8388
 URL: https://issues.apache.org/jira/browse/ARROW-8388
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


See https://github.com/apache/arrow/pull/6883#issuecomment-611661733

This is a recurring problem which usually shows up as a broken nightly (the 
gandiva nightly jobs, specifically) along with similar issues due to gcc 4.8's 
incomplete handling of c++11. As long as someone depends on these we should 
probably have an every-commit CI job which checks we haven't introduced such a 
breakage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8367) [C++] Is FromString(..., pool) worthwhile

2020-04-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8367:
---

 Summary: [C++] Is FromString(..., pool) worthwhile
 Key: ARROW-8367
 URL: https://issues.apache.org/jira/browse/ARROW-8367
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


>From [https://github.com/apache/arrow/pull/6863#discussion_r404913683]

There are currently two overloads of {{Buffer::FromString}}, one which takes an 
rvalue reference to string and another which takes a const reference and a 
MemoryPool. In the former case the string is simply moved into a Buffer 
subclass while in the latter the MemoryPool is used to allocate space into 
which the string's contents are copied, which necessitates bubbling the 
potential allocation failure. This seems gratuitous given we don't use 
{{std::string}} to store large quantities so it should be fine to provide only
{code:java}
  static std::unique_ptr FromString(std::string data); 
{code}
and rely on {{std::string}}'s copy constructor when the argument is not an 
rvalue.

In the case of a {{std::string}} which may/does contain large data and must be 
copied, tracking the copied memory with a MemoryPool does not require a great 
deal of boilerplate:
{code:java}
ARROW_ASSIGN_OR_RAISE(auto buffer,
  Buffer(large).CopySlice(0, large.size(), pool));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8328) [C++] MSVC is not respecting warning-disable flags

2020-04-03 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8328:
---

 Summary: [C++] MSVC is not respecting warning-disable flags
 Key: ARROW-8328
 URL: https://issues.apache.org/jira/browse/ARROW-8328
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


We provide [warning-disabling flags to 
MSVC|https://github.com/apache/arrow/blob/72433c6/cpp/cmake_modules/SetupCxxFlags.cmake#L151-L153]
 including one which should disable all conversion warnings. However this is 
not completely effectual and Appveyor will still emit conversion warnings 
(which are then treated as errors), requiring insertion of otherwise 
unnecessary explicit casts or {{#pragma}}s (for example 
https://github.com/apache/arrow/pull/6820 ).

Perhaps flag ordering is significant? In any case, as we have conversion 
warnings disabled for other compilers we should ensure they are completely 
disabled for MSVC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8323) [C++] Pin gRPC at v1.27 to avoid compilation error in its headers

2020-04-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8323:
---

 Summary: [C++] Pin gRPC at v1.27 to avoid compilation error in its 
headers
 Key: ARROW-8323
 URL: https://issues.apache.org/jira/browse/ARROW-8323
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


[gRPC 1.28|https://github.com/grpc/grpc/releases/tag/v1.28.0] includes a change 
which introduces an implicit size_t->int conversion in proto_utils.h: 
https://github.com/grpc/grpc/commit/2748755a4ff9ed940356e78c105f55f839fdf38b

Conversion warnings are treated as errors for example here: 
https://ci.appveyor.com/project/BenjaminKietzman/arrow/build/job/9cl0vqa8e495knn3#L1126
So IIUC we need to pin gRPC to 1.27 for now.

Upstream PR: https://github.com/grpc/grpc/pull/22557



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8315) [Python]

2020-04-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8315:
---

 Summary: [Python]
 Key: ARROW-8315
 URL: https://issues.apache.org/jira/browse/ARROW-8315
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ben Kietzman






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8310) [C++] Minio's exceptions not recognized by IsConnectError()

2020-04-01 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8310:
---

 Summary: [C++] Minio's exceptions not recognized by 
IsConnectError()
 Key: ARROW-8310
 URL: https://issues.apache.org/jira/browse/ARROW-8310
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Minio emits an {{XMinioServerNotInitialized}} exception on failure to connect, 
which is recognized by {{ConnectRetryStrategy}} and used to trigger a retry 
instead of an error. This exception has an HTTP error 503.

However this code does not round trip through the AWS SDK, which maintains an 
explicit [mapping from known exception names to error 
codes|https://github.com/aws/aws-sdk-cpp/blob/d36c2b16c9c3caf81524ebfff1e70782b8e1a006/aws-cpp-sdk-core/source/client/CoreErrors.cpp#L37]
 and will demote an unrecognized exception name [to 
{{CoreErrors::UNKNOWN}}|https://github.com/aws/aws-sdk-cpp/blob/master/aws-cpp-sdk-core/source/client/AWSErrorMarshaller.cpp#L150]

The end result is flakiness in the test (and therefore CI) since 
{{ConnectRetryStrategy}} never gets a chance to operate, see for example 
https://github.com/apache/arrow/pull/6789/checks?check_run_id=552871444#step:6:1778

Probably {{IsConnectError}} will need to examine the error string in the event 
of {{CoreErrors::UNKNOWN}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers

2020-03-31 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8296:
---

 Summary: [C++][Dataset] IpcFileFormat should support writing files 
with compressed buffers
 Key: ARROW-8296
 URL: https://issues.apache.org/jira/browse/ARROW-8296
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8295) [C++][Dataset] IpcFileFormat should expliclity push down column projection

2020-03-31 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8295:
---

 Summary: [C++][Dataset] IpcFileFormat should expliclity push down 
column projection
 Key: ARROW-8295
 URL: https://issues.apache.org/jira/browse/ARROW-8295
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8235) [C++][Compute] Filter out nulls by default

2020-03-26 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8235:
---

 Summary: [C++][Compute] Filter out nulls by default
 Key: ARROW-8235
 URL: https://issues.apache.org/jira/browse/ARROW-8235
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


The filter kernel currently emits null when a slot in the selection mask is 
null. For compatibility with Kleene logic systems like SQL, this behavior 
should be configurable. Provide an option enumeration:

{code}
struct FilterOptions {
  enum NullSelectionBehavior {
/// null slots in the selection mask will drop the filtered value
DROP,
/// null slots in the selection mask will keep the filtered value
KEEP,
/// null slots in the selection mask will replace the filtered value with 
null
EMIT_NULL,
  } null_selection_behavior;
};
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment

2020-03-24 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8201:
---

 Summary: [Python][Dataset] Improve ergonomics of FileFragment
 Key: ARROW-8201
 URL: https://issues.apache.org/jira/browse/ARROW-8201
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


FileFragment can be made more directly useful by adding convenience methods.

For example, a FileFragment could allow underlying file/buffer to be opened 
directly:
{code}
def open(self):
"""
Open a NativeFile of the buffer or file viewed by this fragment.
"""
cdef:
CFileSystem* c_filesystem
shared_ptr[CRandomAccessFile] opened
NativeFile out = NativeFile()

buf = self.buffer
if buf is not None:
return pa.io.BufferReader(buf)

with nogil:
c_filesystem = self.file_fragment.source().filesystem()
opened = GetResultValue(c_filesystem.OpenInputFile(
self.file_fragment.source().path()))

out.set_random_access_file(opened)
out.is_readable = True
return out
{code}

Additionally, a ParquetFileFragment's metadata could be introspectable:
{code}
@property
def metadata(self):
from pyarrow._parquet import ParquetReader
reader = ParquetReader()
reader.open(self.open())
return reader.metadata
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8164) [C++][Dataset] Let datasets be viewable with non-identical schema

2020-03-19 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8164:
---

 Summary: [C++][Dataset] Let datasets be viewable with 
non-identical schema
 Key: ARROW-8164
 URL: https://issues.apache.org/jira/browse/ARROW-8164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


It would be useful to allow some schema unification capability after discovery 
has completed. For example, if a FileSystemDataset is being wrapped into a 
UnionDataset with another and their schemas are unifiable then there is no 
reason we can't create the UnionDataset (rather than emitting an error because 
the schemas are not identical).

I think this behavior will be most naturally expressed in C++ like so:

{code}
virtual Result Dataset::ReplaceSchema(std::shared_ptr schema) 
const = 0;
{code}

which will raise an error if the provided schema is not unifiable with the 
current dataset schema.

If this needs to be extended to non trivial projections then this will probably 
warrant a separate class, {{ProjectedDataset}} or so. Definitely follow up 
material (if desired)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy

2020-03-19 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8163:
---

 Summary: [C++][Dataset] Allow FileSystemDataset's file list to be 
lazy
 Key: ARROW-8163
 URL: https://issues.apache.org/jira/browse/ARROW-8163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


A FileSystemDataset currently requires a full listing of files it contains on 
construction, so a scan cannot start until all files in the dataset are 
discovered. Instead it would be ideal if a large dataset could be constructed 
with a lazy file listing so that scans can start immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8137) [C++][Dataset] Investigate multithreaded discovery

2020-03-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8137:
---

 Summary: [C++][Dataset] Investigate multithreaded discovery
 Key: ARROW-8137
 URL: https://issues.apache.org/jira/browse/ARROW-8137
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Currently FileSystemDatasetFactory Inpsects all files serially. For slow file 
systems or systems which support batched reads, this could be accelerated by 
inspecting files in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8114) [Java][Integration] Enable custom_metadata integration test

2020-03-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8114:
---

 Summary: [Java][Integration] Enable custom_metadata integration 
test
 Key: ARROW-8114
 URL: https://issues.apache.org/jira/browse/ARROW-8114
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration, Java
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 0.17.0


This will require refactoring the way metadata is serialized to JSON following 
https://github.com/apache/arrow/pull/6556 (needs to be {{[{key: "$key", value: 
"$value"}]}}, rather than {{ {"$key": "$value"} }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8113) [C++] Implement a lighter-weight variant

2020-03-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8113:
---

 Summary: [C++] Implement a lighter-weight variant
 Key: ARROW-8113
 URL: https://issues.apache.org/jira/browse/ARROW-8113
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


{{util::variant}} is an extremely useful structure but its header slows 
compilation significantly, so using it in public headers is questionable 
https://github.com/apache/arrow/pull/6545#discussion_r388406246

I'll try writing a lighter weight version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Semantics of custom_metadata

2020-03-11 Thread Ben Kietzman
While working on https://issues.apache.org/jira/browse/ARROW-2255
(serialize custom_metadata in the integration tests), we had the following
discussion on GitHub:
https://github.com/apache/arrow/pull/6556#pullrequestreview-372405940

In short, although in Schema.fbs custom_metadata is declared as an array of
KeyValue pairs (so duplicate keys would be possible), all reference
implementations assume it to represent an associative map with unique keys.

Is there a use case for duplicate metadata keys? It seems that an
acceptable resolution might be to note in Schema.fbs that implementations
are allowed to assume that keys are unique

Ben


[jira] [Created] (ARROW-8058) [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions

2020-03-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8058:
---

 Summary: [C++][Python][Dataset] Provide an option to skip 
validation in FileSystemDatasetFactoryOptions
 Key: ARROW-8058
 URL: https://issues.apache.org/jira/browse/ARROW-8058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This can be costly and is not always necessary.

At the same time we could move file validation into the scan tasks; currently 
all files are inspected as the dataset is constructed, which can be expensive 
if the filesystem is slow. We'll be performing the validation multiple times 
but the check will be cheap since at scan time we'll be reading the file into 
memory anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8047) [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8047:
---

 Summary: [Python][Documentation] Document migration from 
ParquetDataset to pyarrow.datasets
 Key: ARROW-8047
 URL: https://issues.apache.org/jira/browse/ARROW-8047
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Joris Van den Bossche
 Fix For: 0.17.0


We need documentation describing a migration path from ParquetDataset, at least 
for the basic user facing API of ParquetDataset (As I read it, that's: 
construction, projection, filtering, threading, for a first pass). Following 
this we could mark ParquetDataset as deprecated, building features needed by 
power users like dask and adding those to the migration document



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8046) [Developer][Integration] Makefile.docker's target names are broken

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8046:
---

 Summary: [Developer][Integration] Makefile.docker's target names 
are broken
 Key: ARROW-8046
 URL: https://issues.apache.org/jira/browse/ARROW-8046
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


docker-compose.yml now prefixes targets with platform identifiers: {{cpp -> 
conda-cpp}}

Makefile.docker does not include these prefixes, and should be updated or 
removed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8044) [NIGHTLY:gandiva-jar-osx] pygit2 needs libgit2 v1.0.x

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8044:
---

 Summary: [NIGHTLY:gandiva-jar-osx] pygit2 needs libgit2 v1.0.x
 Key: ARROW-8044
 URL: https://issues.apache.org/jira/browse/ARROW-8044
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


pygit2 (required by crossbow itself) is failing to build because it doesn't 
support the latest version of libgit2 
https://travis-ci.org/ursa-labs/crossbow/builds/659892661#L4542

for now, we'll have to pin it at 1.0.x 
https://github.com/libgit2/pygit2/blob/42df4cb3eb95272cb7bdb76fdd4f127d370e3096/src/types.h#L36



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8043:
---

 Summary: [Developer] Provide better visibility for failed nightly 
builds
 Key: ARROW-8043
 URL: https://issues.apache.org/jira/browse/ARROW-8043
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Emails reporting nightly failures are unsatisfactory in two ways: there is a 
large click/scroll distance between the links presented in that email and the 
actual error message. Worse, once one is there it's not clear what JIRAs have 
been made or which of them are in progress.

One solution would be to replace or augment the [NIGHTLY] email with a page 
(https://ursa-labs.github.org/crossbow would be my favorite) which shows how 
many nights it has failed, a shortcut to the actual error line in CI's logs, 
and useful views of JIRA. We could accomplish this with:
- dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
associated with specific jobs
- client side JavaScript to scrape JIRA and update the page dynamically as soon 
as JIRAs are opened
- provide automatic and expedited creation of correctly labelled JIRAs, so that 
viewers can quickly organize/take ownership of a failed nightly job. JIRA 
supports reading form fields from URL parameters, so this would be fairly 
straightforward: 
https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8039:
---

 Summary: [C++][Python][Dataset] Assemble a minimal ParquetDataset 
shim
 Key: ARROW-8039
 URL: https://issues.apache.org/jira/browse/ARROW-8039
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Assemble a minimal ParquetDataset shim backed by {{pyarrow.dataset.*}}. Replace 
the existing ParquetDataset with the shim by default, allow opt-out for users 
who need the current ParquetDataset

This is mostly exploratory to see which of the python tests fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8037) [Developer][Integration] Consolidate example JSON and test/validate uniformly

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8037:
---

 Summary: [Developer][Integration] Consolidate example JSON and 
test/validate uniformly
 Key: ARROW-8037
 URL: https://issues.apache.org/jira/browse/ARROW-8037
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Currently the schema for the integration JSON representation is expressed in 
prose only. It should be rewritten using a JSON schema representation and used 
to validate generated as well as (checked-in) example JSON. {{jsonschema}} 
would work https://python-jsonschema.readthedocs.io/en/stable/validate/

Additionally, languages unit test their JSON parsers/converters against 
differing example JSON:

C++ includes some inline 
https://github.com/apache/arrow/blob/b0bb6841af91ed2fddc201ca1ff6bedac7629e2e/cpp/src/arrow/ipc/json_integration_test.cc#L317-L346

While Java uses an explicit file 
https://github.com/apache/arrow/blob/5b783fe35284c97f6bcf107650ba7cfd66a76750/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java#L189

This is brittle and difficult to discover. A single directory of checked-in 
example JSON should be assembled, against which all languages should unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8035) [Developer][Integration] Add integration tests for extension types

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8035:
---

 Summary: [Developer][Integration] Add integration tests for 
extension types
 Key: ARROW-8035
 URL: https://issues.apache.org/jira/browse/ARROW-8035
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This will extend the test runners for each implementation: each must register 
an extension type (I arbitrarily nominate {{uuid}} from the C++ unit tests for 
extension types) which will be serialized to its storage type and the reserved 
metadata fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8034) [JavaScript][Integration] Enable custom_metadata integtration test

2020-03-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8034:
---

 Summary: [JavaScript][Integration] Enable custom_metadata 
integtration test
 Key: ARROW-8034
 URL: https://issues.apache.org/jira/browse/ARROW-8034
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration, JavaScript
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


https://github.com/apache/arrow/pull/6556 adds an integration test including 
custom metadata but JavaScript is skipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Duplicate field names

2020-03-07 Thread Ben Kietzman
Thanks.

I've added
https://issues.apache.org/jira/browse/ARROW-8027 (add integration test
cases including duplicated field names)
and
https://issues.apache.org/jira/browse/ARROW-8028 (remove the restriction
from the Go implementation)

On Sat, Mar 7, 2020 at 11:07 AM Wes McKinney  wrote:

> Since duplicate field names are permitted (which is to say, not
> prohibited) by the Arrow IPC metadata, it seems appropriate to probe
> this behavior in the integration tests.
>
> On Sat, Mar 7, 2020 at 10:02 AM Ben Kietzman 
> wrote:
> >
> > Go asserts unique field names,
> > https://github.com/apache/arrow/blob/084549a/go/arrow/schema.go#L117
> >
> > The C++ (and Java, IIUC) implementation does not, and field name
> uniqueness
> > is not discussed in Schema.fbs
> >
> > I discovered this when adding a schema with duplicate field names to
> > datagen.py in the integration tests as part of a patch for ARROW-2255
> > (custom metadata integration tests)
> >
> > - Go failure:
> >
> https://github.com/apache/arrow/pull/6556/checks?check_run_id=491383663#step:5:5030
> > - Java failure (maybe unrelated?):
> >
> https://github.com/apache/arrow/pull/6556/checks?check_run_id=491383663#step:5:4827
> >
> > I'll remove the duplicated field name from my patch, but this is
> > unsatisfactory because I'm not sure what follow up JIRA(s) to open.
> Should
> > we have an integration test which ensures sibling fields may have
> identical
> > names? Or is field uniqueness a choice an implementation may make?
>


[jira] [Created] (ARROW-8028) [Go] Allow duplicate field names in schemas and nested types

2020-03-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8028:
---

 Summary: [Go] Allow duplicate field names in schemas and nested 
types
 Key: ARROW-8028
 URL: https://issues.apache.org/jira/browse/ARROW-8028
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Go's implementation of Schema panics if field names are duplicated within a 
schema. This is not guaranteed by the standard, so Go will not be able to 
handle valid record batches produced by other implementations which contain 
these.

https://github.com/apache/arrow/blob/084549a/go/arrow/schema.go#L117



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8027) [Developer][Integration] Add integration tests for duplicate field names

2020-03-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8027:
---

 Summary: [Developer][Integration] Add integration tests for 
duplicate field names
 Key: ARROW-8027
 URL: https://issues.apache.org/jira/browse/ARROW-8027
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Schemas and nested types whose fields' names are not unique are permitted, so 
the integration tests should include a case which exercises these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8022) [C++] Provide or Vendor a small_vector implementation

2020-03-06 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8022:
---

 Summary: [C++] Provide or Vendor a small_vector implementation
 Key: ARROW-8022
 URL: https://issues.apache.org/jira/browse/ARROW-8022
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


{{small_vector<>}} is a C++ utility class which does not use heap allocation 
for small numbers of elements. 
[Folly|https://github.com/facebook/folly/blob/master/folly/docs/small_vector.md],
 
[Boost|https://github.com/boostorg/container/blob/develop/include/boost/container/small_vector.hpp],
 
[Abseil|https://github.com/abseil/abseil-cpp/blob/master/absl/container/inlined_vector.h],
 and [LLVM|https://llvm.org/doxygen/classllvm_1_1SmallVector.html] each provide 
one.

In many cases a vector usually has few elements but might have many. If we use 
std::vector we have to bother the allocator unless the vector is actually 
empty. My specific use case for this is field lookup by name: I expect that 
most schemas will have unique field names, but strictly speaking we support 
duplicate field names. It would be ideal not incur a performance penalty for 
99.9% of field lookups which yield 0 or 1 fields just to accommodate the case 
where there may be multiple.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8014) [C++] Provide CMake targets to test only within a given label

2020-03-05 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8014:
---

 Summary: [C++] Provide CMake targets to test only within a given 
label
 Key: ARROW-8014
 URL: https://issues.apache.org/jira/browse/ARROW-8014
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Tests are labelled but this feature is not easily accessible from Ninja or 
Make. Provide targets like {{test-label-arrow_dataset}} which exercises only 
the tests labelled {{arrow_dataset}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8012) [C++][CI] Set CTEST_PARALLEL_LEVEL to $concurrency

2020-03-05 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8012:
---

 Summary: [C++][CI] Set CTEST_PARALLEL_LEVEL to $concurrency
 Key: ARROW-8012
 URL: https://issues.apache.org/jira/browse/ARROW-8012
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, CI
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Currently the default {{test}} target runs serially while {{unittest}} 
arbitrarily uses 4 threads. On many systems that's suboptimal. The environment 
variable {{CTEST_PARALLEL_LEVEL}} can be set to run tests in parallel, and we 
should probably have a good default for it (like the hardware concurrency)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7946) [C++] Deduplicate schema equivalence checks

2020-02-26 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7946:
---

 Summary: [C++] Deduplicate schema equivalence checks
 Key: ARROW-7946
 URL: https://issues.apache.org/jira/browse/ARROW-7946
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


There are several locations where a group of schemas is checked for equivalence 
including
{{UnionDataset::Make}}, {{Table::FromRecordBatches}}, {{ConcatenateTables}}, 
and {{WriteRecordBatchStream}}. These should be extracted to a helper function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7945) [C++][Dataset] Implement InMemoryDatasetFactory

2020-02-26 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7945:
---

 Summary: [C++][Dataset] Implement InMemoryDatasetFactory
 Key: ARROW-7945
 URL: https://issues.apache.org/jira/browse/ARROW-7945
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This will allow in memory datasets (such as tables) to participate in discovery 
through {{UnionDatasetFactory}}. This class will be trivial since Inspect will 
do nothing but return the table's schema, but is necessary to ensure that the 
resulting {{UnionDataset}}'s unified schema accommodates the table's schema 
(for example including fields present only in the table's schema or emitting an 
error when unification is not possible)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7938) [C++] Add tests for DayTimeIntervalBuilder

2020-02-25 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7938:
---

 Summary: [C++] Add tests for DayTimeIntervalBuilder
 Key: ARROW-7938
 URL: https://issues.apache.org/jira/browse/ARROW-7938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Micah Kornfield
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7916) [C++][Dataset] Project IPC record batches to materialized fields

2020-02-21 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7916:
---

 Summary: [C++][Dataset] Project IPC record batches to materialized 
fields
 Key: ARROW-7916
 URL: https://issues.apache.org/jira/browse/ARROW-7916
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


If batches mmaped from disk are projected before post filtering, unreferenced 
columns will never be accessed (so the memory map shouldn't do I/O on them).

At the same time, it'd probably be wise to explicitly document that batches 
yielded directly from fragments rather than from a Scanner will not be filtered 
or projected (so they will not match the fragment's schema and will include 
columns referenced by the filter even if they were not projected).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7910) [C++] Provide function to query page size portably

2020-02-21 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7910:
---

 Summary: [C++] Provide function to query page size portably
 Key: ARROW-7910
 URL: https://issues.apache.org/jira/browse/ARROW-7910
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


Page size is a useful default buffer size for buffered readers. Where should 
this property be attached? MemoryManager/Device?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7896) [C++] Refactor from #include guards to #pragma once

2020-02-20 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7896:
---

 Summary: [C++] Refactor from #include guards to #pragma once
 Key: ARROW-7896
 URL: https://issues.apache.org/jira/browse/ARROW-7896
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


All compilers we support handle {{#pragma once}} correctly, and it reduces our 
header boilerplate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7894) [C++] DefineOptions should invoke add_definitions

2020-02-20 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7894:
---

 Summary: [C++] DefineOptions should invoke add_definitions
 Key: ARROW-7894
 URL: https://issues.apache.org/jira/browse/ARROW-7894
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Several build options are mirrored as preprocessor definitions, for example 
\{{ARROW_JEMALLOC}}. This could be made more consistent by requiring that every 
option in DefineOptions should also define a preprocessor macro with 
{{add_definitions}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7886) [C++][Dataset] Consolidate Source and Dataset

2020-02-19 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7886:
---

 Summary: [C++][Dataset] Consolidate Source and Dataset
 Key: ARROW-7886
 URL: https://issues.apache.org/jira/browse/ARROW-7886
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Source and Dataset are very similar concepts (collections of multiple data 
fragments). Consolidating them would decrease doc burden without reducing our 
flexibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7824) [C++][Dataset] Provide Dataset writing to IPC format

2020-02-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7824:
---

 Summary: [C++][Dataset] Provide Dataset writing to IPC format
 Key: ARROW-7824
 URL: https://issues.apache.org/jira/browse/ARROW-7824
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Begin with writing to IPC format since it is simpler than parquet and to 
efficiently support the "locally cached extract" workflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7822) [C++] Allocation free error Status constants

2020-02-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7822:
---

 Summary: [C++] Allocation free error Status constants
 Key: ARROW-7822
 URL: https://issues.apache.org/jira/browse/ARROW-7822
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


{{Status::state_}} could be made a tagged pointer without affecting the fast 
path (passing around a non error status). The extra bit could be used to mark a 
Status' state as heap allocated or not, allowing very error statuses to be 
extremely cheap when their error state is known to be immutable. For example, 
this would allow a cheap default of {{Result<>::status_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7664) [C++] Extract localfs default from FileSystemFromUri

2020-01-23 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7664:
---

 Summary: [C++] Extract localfs default from FileSystemFromUri
 Key: ARROW-7664
 URL: https://issues.apache.org/jira/browse/ARROW-7664
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Antoine Pitrou
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/6257#pullrequestreview-347506792]

The argument to FileSystemFromUri should always be rfc3986 formatted. The 
current fallback to localfs can be recovered by adding {{static string 
Uri::FromPath(string)}} which wraps 
[uriWindowsFilenameToUriStringA|https://uriparser.github.io/doc/api/latest/Uri_8h.html#a422dc4a2b979ad380a4dfe007e3de845]
 and the corresponding unix path function.
{code:java}
FileSystemFromUri(Uri::FromPath(R"(E:\dir\file.txt)"), ) {code}
This is a little more boilerplate but I think it's worthwhile to be explicit 
here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7594) [C++] Implement HTTP and FTP file systems

2020-01-16 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7594:
---

 Summary: [C++] Implement HTTP and FTP file systems
 Key: ARROW-7594
 URL: https://issues.apache.org/jira/browse/ARROW-7594
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
 Fix For: 1.0.0


It'd be handy to have (probably read only) a generic filesystem implementation 
which wrapped {{any cURLable base url}}:

{code}
ARROW_ASSIGN_OR_RAISE(auto fs, 
HttpFileSystem::Make("https://some.site/json-api/v3;));
ASSERT_OK_AND_ASSIGN(auto json_stream, fs->OpenInputStream("slug"));
// ...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7415) [C++][Dataset] Implement IpcFormat for sources composed of ipc files

2019-12-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7415:
---

 Summary: [C++][Dataset] Implement IpcFormat for sources composed 
of ipc files
 Key: ARROW-7415
 URL: https://issues.apache.org/jira/browse/ARROW-7415
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Dataset
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Currently only parquet is supported. IPC files make a nice test case for 
multiple file formats since they also have a completely unambiguous physical 
schema (unlike CSV) and support for reading/writing is already present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7414) [R][Dataset] Add tests for PartitionSchemeDiscovery

2019-12-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7414:
---

 Summary: [R][Dataset] Add tests for PartitionSchemeDiscovery
 Key: ARROW-7414
 URL: https://issues.apache.org/jira/browse/ARROW-7414
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7413) [c

2019-12-17 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7413:
---

 Summary: [c
 Key: ARROW-7413
 URL: https://issues.apache.org/jira/browse/ARROW-7413
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ben Kietzman






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7391) [Python] Remove unnecessary classes from the binding layer

2019-12-13 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7391:
---

 Summary: [Python] Remove unnecessary classes from the binding layer
 Key: ARROW-7391
 URL: https://issues.apache.org/jira/browse/ARROW-7391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Ben Kietzman


Several Python classes introduced by https://github.com/apache/arrow/pull/5237 
are unnecessary and can be removed in favor of simple functions which produce 
opaque pointers, including the PartitionScheme and Expression classes. These 
should be removed to reduce cognitive overhead of the Python datasets API and 
to loosen coupling between Python and C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7382) [C++][Dataset] Refactor FsDsDiscovery constructors

2019-12-12 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7382:
---

 Summary: [C++][Dataset] Refactor FsDsDiscovery constructors
 Key: ARROW-7382
 URL: https://issues.apache.org/jira/browse/ARROW-7382
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This constructor should not take a vector of filestats. Instead, provide a 
convenience constructor taking paths and a filesystem pointer. Also, ensure 
that missing parent directories are injected (otherwise partition scheme 
application will fail).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7373) [C++][Dataset] Remove FileSource

2019-12-11 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7373:
---

 Summary: [C++][Dataset] Remove FileSource
 Key: ARROW-7373
 URL: https://issues.apache.org/jira/browse/ARROW-7373
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


FileSource doesn't do enough, and should be removed. Methods in {{FileFormat}} 
etc which reference the class should be refactored to take a 
{{RandomAccessFile}} and convenience overloads provided to take a buffer or a 
(path, filesystem).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7366) [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery

2019-12-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7366:
---

 Summary: [C++][Dataset] Use PartitionSchemeDiscovery in 
DataSourceDiscovery
 Key: ARROW-7366
 URL: https://issues.apache.org/jira/browse/ARROW-7366
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman


https://github.com/apache/arrow/pull/5950 introduces 
{{PartitionSchemeDiscovery}}, but ideally it would be supplied as an option to 
data source discovery and the partition schema automatically discovered based 
on the file paths accumulated then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7332) [C++][Parquet]

2019-12-05 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7332:
---

 Summary: [C++][Parquet] 
 Key: ARROW-7332
 URL: https://issues.apache.org/jira/browse/ARROW-7332
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


PARQUET_THROW_NOT_OK throws a ParquetStatusException, which contains a full 
Status rather than just an error string. These could be caught explicitly in 
PARQUET_CATCH_NOT_OK and the original status returned rather than creating a 
new status:

{code}
  } catch (const ::parquet::ParquetStatusException& e) { \
return e.status(); \
  } catch (const ::parquet::ParquetException& e) { \
return Status::IOError(e.what()) \
{code}

This will retain the original StatusCode rather than overwriting it with 
IOError.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7319) [C++] Refactor Iterator to yield Result

2019-12-04 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7319:
---

 Summary: [C++] Refactor Iterator to yield Result
 Key: ARROW-7319
 URL: https://issues.apache.org/jira/browse/ARROW-7319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7313) [C++] Add function for retrieving a scalar from an array slot

2019-12-04 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7313:
---

 Summary: [C++] Add function for retrieving a scalar from an array 
slot
 Key: ARROW-7313
 URL: https://issues.apache.org/jira/browse/ARROW-7313
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


It'd be useful to construct scalar values given an array and an index.

{code}
/* static */ std::shared_ptr Scalar::FromArray(const Array&, int64_t);
{code}

Since this is much less efficient than unboxing the entire array and accessing 
its buffers directly, it should not be used in hot loops.

[~kszucs] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7303) [C++] Refactor benchmarks to use new Result APIs

2019-12-03 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7303:
---

 Summary: [C++] Refactor benchmarks to use new Result APIs
 Key: ARROW-7303
 URL: https://issues.apache.org/jira/browse/ARROW-7303
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


When building benchmarks, I get the following error:
{code}
../src/arrow/csv/converter_benchmark.cc:83:64: error: too many arguments to 
function call, expected 2, have 3
ABORT_NOT_OK(converter->Convert(parser, 0 /* col_index */, ));
{code}

This was introduced by ARROW-7236. I guess the CI didn't catch it because we 
don't currently build benchmarks? [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7179) [C++][Compute] Coalesce kernel

2019-11-15 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7179:
---

 Summary: [C++][Compute] Coalesce kernel
 Key: ARROW-7179
 URL: https://issues.apache.org/jira/browse/ARROW-7179
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Add a kernel which replaces null values in an array with a scalar value or with 
values taken from another array:

{code}
coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
{code}

The code in {{take_internal.h}} should be of some use with a bit of refactoring.

A filter Expression should be added at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString

2019-11-14 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7172:
---

 Summary: [C++][Dataset] Improve format of Expression::ToString
 Key: ARROW-7172
 URL: https://issues.apache.org/jira/browse/ARROW-7172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read 
{{"b"_ > int32(3)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7151) [C++] Refactor ExpressionEvaluator to yield Arrays

2019-11-12 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7151:
---

 Summary: [C++] Refactor ExpressionEvaluator to yield Arrays
 Key: ARROW-7151
 URL: https://issues.apache.org/jira/browse/ARROW-7151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Currently expressions can be evaluated to scalars or arrays, mostly to 
accomodate ScalarExpression. Instead let all expressions be evaluable to Array 
only. ScalarExpression will evaluate to an array of repeated values, but 
expressions whose corresponding kernels can accept a scalar directly 
(comparison, for example) can avoid materializing this array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7086:
---

 Summary: [C++] Provide a wrapper for invoking factories to produce 
a Result
 Key: ARROW-7086
 URL: https://issues.apache.org/jira/browse/ARROW-7086
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return ResultInvoke(DoSafeAdd, a, b);
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7081) [R] Add methods for introspecting parquet files

2019-11-06 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7081:
---

 Summary: [R] Add methods for introspecting parquet files
 Key: ARROW-7081
 URL: https://issues.apache.org/jira/browse/ARROW-7081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Neal Richardson
 Fix For: 1.0.0


Parquet files are very opaque, and it'd be handy to have an easy way to 
introspect them. Functions exist for loading them as a table, but information 
about row group level metadata and data page compression is hidden. Ideally, 
every structure from https://github.com/apache/parquet-format/#file-format 
could be examined in this fashion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7069) [

2019-11-05 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7069:
---

 Summary: [
 Key: ARROW-7069
 URL: https://issues.apache.org/jira/browse/ARROW-7069
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ben Kietzman






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7058) [C++] FileSystemDataSourceDiscovery should apply partition schemes relative to the base_dir of its selector

2019-11-04 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7058:
---

 Summary: [C++] FileSystemDataSourceDiscovery should apply 
partition schemes relative to the base_dir of its selector
 Key: ARROW-7058
 URL: https://issues.apache.org/jira/browse/ARROW-7058
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Currently, the absolute path of each fragment is used which leads to erroneous 
parse errors unless the Discovery's base directory also happens to be root of a 
FileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7051) [C++] Improve MakeArrayOfNull to support creation of multiple arrays

2019-11-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7051:
---

 Summary: [C++] Improve MakeArrayOfNull to support creation of 
multiple arrays
 Key: ARROW-7051
 URL: https://issues.apache.org/jira/browse/ARROW-7051
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.14.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


MakeArrayOfNull reuses a single buffer of {{0}} for all buffers in the array it 
creates. It could be extended to reuse that same buffer for all buffers in 
multiple arrays. This optimization will make RecordBatchProjector and 
ConcatenateTablesWithPromotion more memory efficient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6967) [C++] Add filter expressions for IN, COALESCE, and DROP_NULL

2019-10-22 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6967:
---

 Summary: [C++] Add filter expressions for IN, COALESCE, and 
DROP_NULL
 Key: ARROW-6967
 URL: https://issues.apache.org/jira/browse/ARROW-6967
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Implement filter expressions for {{IN, COALESCE, DROP_NULL}}

{{IN}} should be backed in TreeEvaluator by the IsIn kernel. The others should 
be initially implemented in terms of logical and filter kernels until a 
specialized kernel is added for those.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6959) [C++] Clarify what signatures are preferred for compute kernels

2019-10-21 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6959:
---

 Summary: [C++] Clarify what signatures are preferred for compute 
kernels
 Key: ARROW-6959
 URL: https://issues.apache.org/jira/browse/ARROW-6959
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
 Fix For: 1.0.0


Many of the compute kernels feature functions which accept only array inputs in 
addition to functions which accept Datums. The former seems implicitly like a 
convenience wrapper around the latter but I don't think this is explicit 
anywhere. Is there a preferred overload for bindings to use? Is it preferred 
that C++ implementers provide convenience wrappers for different permutations 
of argument type? (for example, Filter now provides an overload for record 
batch input as well as array input)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6847) [C++] Add a range_expression interface to Iterator<>

2019-10-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6847:
---

 Summary: [C++] Add a range_expression interface to Iterator<>
 Key: ARROW-6847
 URL: https://issues.apache.org/jira/browse/ARROW-6847
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Iterator provides the Visit() method for visiting each element, but idiomatic 
C++ uses a range for loop



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6781) [C++] Improve and consolidate ARROW_CHECK, DCHECK macros

2019-10-03 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6781:
---

 Summary: [C++] Improve and consolidate ARROW_CHECK, DCHECK macros
 Key: ARROW-6781
 URL: https://issues.apache.org/jira/browse/ARROW-6781
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Currently we have multiple macros like {{DCHECK_EQ}} and {{DCHECK_LT}} which 
check various comparisons but don't report anything about their operands. 
Furthermore, the "stream to assertion" pattern for appending extra info has 
proven fragile. I propose a new unified macro which can capture operands of 
comparisons and report them:

{code:cpp}
  int three = 3;
  int five = 5;
  DCHECK(three == five, "extra: ", 1, 2, five);
{code}

Results in check failure messages like:
{code}
F1003 11:12:46.174767  4166 logging_test.cc:141]  Check failed: three == five
  LHS: 3
  RHS: 5
extra: 125
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2019-10-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-6772:
---

 Summary: [C++] Add operator== for interfaces with an Equals() 
method
 Key: ARROW-6772
 URL: https://issues.apache.org/jira/browse/ARROW-6772
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The addition 
of overloaded equality operators will allow this o be written 
{{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Result vs Status

2019-10-02 Thread Ben Kietzman
The C++ library has two classes which fill mostly the same function. Both
Status and Result<> are used to express a recoverable error in lieu of
exceptions. Result<> is slightly more ergonomic in C++, but our binding
infrastructures assume Status based APIs.

>From the discussion in the sync call, it seems reasonable to require that:
Public APIs which are likely to be directly wrapped in a binding should not
use Result<> to the exclusion of Status. An equivalent Status API should
always be provided for ease of binding.


Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Ben Kietzman
FlatCC seems germane: https://github.com/dvidelabs/flatcc

It compiles flatbuffer schemas down to (idiomatic?) C

Perhaps the schema and batch serialization problems should be solved by
storing everything in the flatbuffer format.
Then the results of running flatcc plus a few simple helpers can be checked
in to provide an accessible C API.
With respect to lifetime, Antoine has already done good work on specifying
a move only contract which could probably be adapted.


On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou  wrote:

>
> One basic design point is to allow exchanging Arrow data with no
> mandatory dependency (the exception is JSON and base64 if you want to
> act on metadata - but that's highly optional, and those are extremely
> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
> not only it introduces a library, but it requires the use of a compiler
> to produce generated code.  It also requires familiarizing with, well,
> Flatbuffers :-)
>
> We can of course discuss this and feel it's not a problem.
>
> Regards
>
> Antoine.
>
>
> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> > There are two pieces of serialized data needed to communicate a record
> > batch from one library to another
> >
> > * Serialized schema (i.e. what's in Schema.fbs)
> > * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> >
> > You _do_ need to use a Flatbuffers library to fully create these
> > message types to interact with any existing record batch disassembly /
> > reassembly.
> >
> > I think I'm most concerned about having a new way to serialize
> > schemas. We already have JSON-based schema serialization for
> > integration test purposes, so one possibility is to standardize that
> > and make it a more formalized part of the project specification.
> >
> > As far as a C protocol, I don't see an especial downside to using the
> > Flatbuffers schema to communicate types.
> >
> > Another thought is to not deviate from the flattened
> > Flatbuffers-styled representation but to translate the Flatbuffers
> > types into C types: namely a C struct-based version of the
> > "RecordBatch" message.
> >
> > Independent of the means to communicate the two pieces of serialized
> > information above (respectively: schemas and record batch field memory
> > addresses and field lengths), having a C-based FFI where project can
> > drop in a header file containing the ABI they are supposed to
> > implement, that seems pretty reasonable to me.
> >
> > If we don't define a standardized in-memory FFI (whether it uses the
> > Flatbuffers objects as inputs/outputs or not) then downstream project
> > will devise their own, and that will cause issues long term.
> >
> > On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> >>> * No dependency on Flatbuffers.
> >>> * No buffer reassembly (data is already exposed in logical Arrow
> format).
> >>> * Zero-copy by design.
> >>> * Easy to reimplement from scratch.
> >>>
> >>> I don't see how the flatbuffer pattern for data headers doesn't
> accomplish
> >>> all of these things. At its definition, is a very simple
> representation of
> >>> data that could be worked with independently of the flatbuffers
> codebase.
> >>> It was designed so systems could map directly into that memory without
> >>> interacting with a flatbuffers library.
> >>>
> >>> Specifically the following three structures were designed to already
> allow
> >>> what I think this proposal is trying to recreate. All three are very
> simple
> >>> to construct in a direct, non-flatbuffer dependent read/write pattern.
> >>
> >> Are they?  Personally, I wouldn't know how to do that.  I don't know
> >> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> >> it be? if it's portable accross different platforms, then it's probably
> >> not compatible with any particular platform's C ABI, or only as a
> >> conincidence), how I'm supposed to make use of the "offset" field, or
> >> what the lifetime / ownership of all this data is.
> >>
> >> I may be missing something, but if the answer is that it's easy to
> >> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> >> project's source code, I'm a bit skeptical.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> struct FieldNode {
> >>>   length: long;
> >>>   null_count: long;
> >>> }
> >>>
> >>> struct Buffer {
> >>>   offset: long;
> >>>   length: long;
> >>> }
> >>>
> >>> table RecordBatch {
> >>>   length: long;
> >>>   nodes: [FieldNode];
> >>>   buffers: [Buffer];
> >>> }
> >>>
> >>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau 
> wrote:
> >>>
>  I'm not clear on why we need to introduce something beyond what
>  flatbuffers already provides. Can someone explain that to me? I'm not
>  really a fan of introducing a second representation of the same data
> (as I
>  understand it).
> 
>  On Thu, Sep 19, 2019 at 1:15 

  1   2   >