Re: MIME type

2019-12-11 Thread Micah Kornfield
Yeah, I was thinking to wait for the actual formats to become "official", but I agree the names won't change. On Thu, Dec 5, 2019 at 9:55 PM Sutou Kouhei wrote: > I think that we don't need to wait for 1.0.0. Because we > don't change existing format names. ("streaming" and "file" > aren't chang

Re: Java - Spark dataframe to Arrow format

2019-12-11 Thread Micah Kornfield
There was a discussion/proposal a while ago on the spark mailing list to use the Arrow memory format natively within spark [1], but the proposal was scaled back to exposing vectorized APIs only IIUC. Looking quickly at the links Wes provided, one option for potential speed-up could be a dynamicall

Re: Human-readable version of Arrow Schema?

2019-12-11 Thread Micah Kornfield
> > With these two together, it would seem not too difficult to create a text > representation for Arrow schemas that (at some point) has some > compatibility guarantees, but maybe I'm missing something? I think the main risk is if somehow flatbuffers JSON parsing doesn't handle backward compatib

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Ravindra Pindikura
I've opened jira to track this issue. Thanks for catching this ! https://issues.apache.org/jira/browse/ARROW-7378 On Wed, Dec 11, 2019 at 11:48 PM Ravindra Pindikura wrote: > > I found that there is something in this PR (the last change to > llvm_generator.cc) that broke the auto vectorization

[jira] [Created] (ARROW-7378) loop vectorization broken in gandiva

2019-12-11 Thread Pindikura Ravindra (Jira)
Pindikura Ravindra created ARROW-7378: - Summary: loop vectorization broken in gandiva Key: ARROW-7378 URL: https://issues.apache.org/jira/browse/ARROW-7378 Project: Apache Arrow Issue Typ

Re: Planned Support for ORC Dataset?

2019-12-11 Thread Neal Richardson
Hi William, ORC is part of the C++ Datasets grand vision: see https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit#heading=h.22aikbvt54fv. That said, I don't think anyone in the Arrow community is currently prioritizing work on ORC, and we'd welcome contributions in

[jira] [Created] (ARROW-7377) [C++][Dataset] Simplify parquet column projection

2019-12-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7377: - Summary: [C++][Dataset] Simplify parquet column projection Key: ARROW-7377 URL: https://issues.apache.org/jira/browse/ARROW-7377 Project: Apache Arro

Planned Support for ORC Dataset?

2019-12-11 Thread William Callaghan
Hi there, Not sure if this is the appropriate place, but I had done some searching and could not find anything with regards to supporting ORC datasets. I see that Parquet datasets are support (where a dataset could contain multiple Parquet files), but I do not see this for ORC (only the ability to

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
This works very well and is much simpler. Thank you for the workaround. On Wed, Dec 11, 2019 at 10:29 AM Antoine Pitrou wrote: > > As a workaround, you can use the following hack: > > >>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")]) > > > >>> arr > > > > 123 nulls > >>> arr

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Antoine Pitrou
As a workaround, you can use the following hack: >>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")]) >>> arr 123 nulls >>> arr.cast(pa.int32()) [ null, null, null, null, null, null, null, null, null, null, ... null, null, null, null, null,

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
Thanks. Ted, I tried using numpy similar to your approach and had the same performance. For the time being I am using a dictionary of data-type to pre-allocated big empty array which should work for me in the meantime. On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou wrote: > > There's a C++ fac

[jira] [Created] (ARROW-7376) parquet NaN/null double statistics can result in endless loop

2019-12-11 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-7376: - Summary: parquet NaN/null double statistics can result in endless loop Key: ARROW-7376 URL: https://issues.apache.org/jira/browse/ARROW-7376 Project: Apache Arrow

[jira] [Created] (ARROW-7375) [Python] Expose C++ MakeArrayOfNull

2019-12-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7375: - Summary: [Python] Expose C++ MakeArrayOfNull Key: ARROW-7375 URL: https://issues.apache.org/jira/browse/ARROW-7375 Project: Apache Arrow Issue Type: Improv

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Antoine Pitrou
There's a C++ facility to do this, but it's not exposed in Python yet. I opened ARROW-7375 for it. Regards Antoine. Le 11/12/2019 à 19:36, Weston Pace a écrit : > I'm trying to combine multiple parquet files. They were produced at > different points in time and have different columns. For e

Re: [ANNOUNCE] New Arrow committer: Joris van den Bossche

2019-12-11 Thread Micah Kornfield
Congrats Joris. On Tue, Dec 10, 2019 at 7:43 AM Rok Mihevc wrote: > Congrats Joris! > > > On Tue, Dec 10, 2019 at 3:38 PM Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > Thanks all! > > > > On Mon, 9 Dec 2019 at 20:38, Krisztián Szűcs > > wrote: > > > > > Congrats Joris! >

[jira] [Created] (ARROW-7374) [Dev] [C++] cuda-cpp docker image fails compiling Arrow

2019-12-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7374: - Summary: [Dev] [C++] cuda-cpp docker image fails compiling Arrow Key: ARROW-7374 URL: https://issues.apache.org/jira/browse/ARROW-7374 Project: Apache Arrow

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Ted Gooch
Not sure if this is any better, but I have an open PR right now in Iceberg, where we are doing something similar: https://github.com/apache/incubator-iceberg/pull/544/commits/28166fd3f0e3a24863048a2721f1ae69f243e2af#diff-51d6edf951c105e1e62a3f1e8b4640aaR319-R341 @staticmethod def create_null_colum

Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
I'm trying to combine multiple parquet files. They were produced at different points in time and have different columns. For example, one has columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I want to concatenate all three into one table with columns A, B, C, D, E. To do t

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Ravindra Pindikura
I found that there is something in this PR (the last change to llvm_generator.cc) that broke the auto vectorization. I'll debug some more tomorrow. https://github.com/apache/arrow/commit/165b02d2358e5c8c2039cf626ac7326d82e3ca90 If I undo this one patch, I can see the vectorization happen with Yi

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
Missing [1] link. [1] https://godbolt.org/z/S8tixP On Wed, Dec 11, 2019 at 12:58 PM Francois Saint-Jacques wrote: > > So, llvm _can_ auto-vectorize, I was just missing the `-mtripple` > option [1]. That still requires to hoist the buffer juggling. > > François > > On Wed, Dec 11, 2019 at 12:56 P

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
So, llvm _can_ auto-vectorize, I was just missing the `-mtripple` option [1]. That still requires to hoist the buffer juggling. François On Wed, Dec 11, 2019 at 12:56 PM Ravindra Pindikura wrote: > > I'll debug this and get back to you - I suspect some recent change broke > this functionality. >

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Ravindra Pindikura
I'll debug this and get back to you - I suspect some recent change broke this functionality. On Wed, Dec 11, 2019 at 10:06 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > It seems that LLVM can't auto vectorize. I don't have a debug build, > so I can't get the `-debug-only` informat

Re: Arrow sync call December 11 at 12:00 US/Eastern, 17:00 UTC

2019-12-11 Thread Francois Saint-Jacques
Attendees: - Antoine Pitrou, Ursa Labs/RStudio - Francois Saint-Jaques, Ursa Labs/RStudio - Ravindra Pindikura, Dremio - Neville Dipale - Rok Mihevc Subjects: - Arrow 1.0 release: - Neville has been working on the Rust IPC bindings (https://github.com/apache/arrow/pull/6013) - Antoine is worki

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
It seems that LLVM can't auto vectorize. I don't have a debug build, so I can't get the `-debug-only` information from llvm-opt/opt about why it can't vectorize. The buffer address mangling should be hoisted out of the loop (still doesn't enable auto vectorization) [1]. The buffer juggling should b

[jira] [Created] (ARROW-7373) [C++][Dataset] Remove FileSource

2019-12-11 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7373: --- Summary: [C++][Dataset] Remove FileSource Key: ARROW-7373 URL: https://issues.apache.org/jira/browse/ARROW-7373 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-7372) [C++] Allow creating dictionary array from simple JSON

2019-12-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7372: - Summary: [C++] Allow creating dictionary array from simple JSON Key: ARROW-7372 URL: https://issues.apache.org/jira/browse/ARROW-7372 Project: Apache Arrow

[NIGHTLY] Arrow Build Report for Job nightly-2019-12-11-0

2019-12-11 Thread Crossbow
Arrow Build Report for Job nightly-2019-12-11-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-11-0 Failed Tasks: - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-11-0-azure-conda-osx-clang-py27 - cond

[jira] [Created] (ARROW-7371) [GLib] Add Datasets binding

2019-12-11 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7371: --- Summary: [GLib] Add Datasets binding Key: ARROW-7371 URL: https://issues.apache.org/jira/browse/ARROW-7371 Project: Apache Arrow Issue Type: New Feature

[Gandiva] question about IR optimization

2019-12-11 Thread Yibo Cai
Hi, I'm trying to figure out how Gandiva works by tracing unit test TestSimpleArichmetic[1]. I met with a problem about Gandiva IR generator and optimizer, would like to seek for help from community. I'm focusing on case "b+1", which adds 1 to each element of an int32 vector. I see there's a