Yeah, I was thinking to wait for the actual formats to become "official",
but I agree the names won't change.
On Thu, Dec 5, 2019 at 9:55 PM Sutou Kouhei wrote:
> I think that we don't need to wait for 1.0.0. Because we
> don't change existing format names. ("streaming" and "file"
> aren't chang
There was a discussion/proposal a while ago on the spark mailing list to
use the Arrow memory format natively within spark [1], but the proposal was
scaled back to exposing vectorized APIs only IIUC.
Looking quickly at the links Wes provided, one option for potential
speed-up could be a dynamicall
>
> With these two together, it would seem not too difficult to create a text
> representation for Arrow schemas that (at some point) has some
> compatibility guarantees, but maybe I'm missing something?
I think the main risk is if somehow flatbuffers JSON parsing doesn't handle
backward compatib
I've opened jira to track this issue. Thanks for catching this !
https://issues.apache.org/jira/browse/ARROW-7378
On Wed, Dec 11, 2019 at 11:48 PM Ravindra Pindikura
wrote:
>
> I found that there is something in this PR (the last change to
> llvm_generator.cc) that broke the auto vectorization
Pindikura Ravindra created ARROW-7378:
-
Summary: loop vectorization broken in gandiva
Key: ARROW-7378
URL: https://issues.apache.org/jira/browse/ARROW-7378
Project: Apache Arrow
Issue Typ
Hi William,
ORC is part of the C++ Datasets grand vision: see
https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit#heading=h.22aikbvt54fv.
That said, I don't think anyone in the Arrow community is currently
prioritizing work on ORC, and we'd welcome contributions in
Francois Saint-Jacques created ARROW-7377:
-
Summary: [C++][Dataset] Simplify parquet column projection
Key: ARROW-7377
URL: https://issues.apache.org/jira/browse/ARROW-7377
Project: Apache Arro
Hi there,
Not sure if this is the appropriate place, but I had done some searching
and could not find anything with regards to supporting ORC datasets. I see
that Parquet datasets are support (where a dataset could contain multiple
Parquet files), but I do not see this for ORC (only the ability to
This works very well and is much simpler. Thank you for the workaround.
On Wed, Dec 11, 2019 at 10:29 AM Antoine Pitrou wrote:
>
> As a workaround, you can use the following hack:
>
> >>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")])
>
>
> >>> arr
>
>
>
> 123 nulls
> >>> arr
As a workaround, you can use the following hack:
>>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")])
>>> arr
123 nulls
>>> arr.cast(pa.int32())
[
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
...
null,
null,
null,
null,
null,
Thanks. Ted, I tried using numpy similar to your approach and had the same
performance. For the time being I am using a dictionary of data-type to
pre-allocated big empty array which should work for me in the meantime.
On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou wrote:
>
> There's a C++ fac
Pierre Belzile created ARROW-7376:
-
Summary: parquet NaN/null double statistics can result in endless
loop
Key: ARROW-7376
URL: https://issues.apache.org/jira/browse/ARROW-7376
Project: Apache Arrow
Antoine Pitrou created ARROW-7375:
-
Summary: [Python] Expose C++ MakeArrayOfNull
Key: ARROW-7375
URL: https://issues.apache.org/jira/browse/ARROW-7375
Project: Apache Arrow
Issue Type: Improv
There's a C++ facility to do this, but it's not exposed in Python yet.
I opened ARROW-7375 for it.
Regards
Antoine.
Le 11/12/2019 à 19:36, Weston Pace a écrit :
> I'm trying to combine multiple parquet files. They were produced at
> different points in time and have different columns. For e
Congrats Joris.
On Tue, Dec 10, 2019 at 7:43 AM Rok Mihevc wrote:
> Congrats Joris!
>
>
> On Tue, Dec 10, 2019 at 3:38 PM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Thanks all!
> >
> > On Mon, 9 Dec 2019 at 20:38, Krisztián Szűcs
> > wrote:
> >
> > > Congrats Joris!
>
Antoine Pitrou created ARROW-7374:
-
Summary: [Dev] [C++] cuda-cpp docker image fails compiling Arrow
Key: ARROW-7374
URL: https://issues.apache.org/jira/browse/ARROW-7374
Project: Apache Arrow
Not sure if this is any better, but I have an open PR right now in Iceberg,
where we are doing something similar:
https://github.com/apache/incubator-iceberg/pull/544/commits/28166fd3f0e3a24863048a2721f1ae69f243e2af#diff-51d6edf951c105e1e62a3f1e8b4640aaR319-R341
@staticmethod
def create_null_colum
I'm trying to combine multiple parquet files. They were produced at
different points in time and have different columns. For example, one has
columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I
want to concatenate all three into one table with columns A, B, C, D, E.
To do t
I found that there is something in this PR (the last change to
llvm_generator.cc) that broke the auto vectorization. I'll debug some more
tomorrow.
https://github.com/apache/arrow/commit/165b02d2358e5c8c2039cf626ac7326d82e3ca90
If I undo this one patch, I can see the vectorization happen with Yi
Missing [1] link.
[1] https://godbolt.org/z/S8tixP
On Wed, Dec 11, 2019 at 12:58 PM Francois Saint-Jacques
wrote:
>
> So, llvm _can_ auto-vectorize, I was just missing the `-mtripple`
> option [1]. That still requires to hoist the buffer juggling.
>
> François
>
> On Wed, Dec 11, 2019 at 12:56 P
So, llvm _can_ auto-vectorize, I was just missing the `-mtripple`
option [1]. That still requires to hoist the buffer juggling.
François
On Wed, Dec 11, 2019 at 12:56 PM Ravindra Pindikura wrote:
>
> I'll debug this and get back to you - I suspect some recent change broke
> this functionality.
>
I'll debug this and get back to you - I suspect some recent change broke
this functionality.
On Wed, Dec 11, 2019 at 10:06 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:
> It seems that LLVM can't auto vectorize. I don't have a debug build,
> so I can't get the `-debug-only` informat
Attendees:
- Antoine Pitrou, Ursa Labs/RStudio
- Francois Saint-Jaques, Ursa Labs/RStudio
- Ravindra Pindikura, Dremio
- Neville Dipale
- Rok Mihevc
Subjects:
- Arrow 1.0 release:
- Neville has been working on the Rust IPC bindings
(https://github.com/apache/arrow/pull/6013)
- Antoine is worki
It seems that LLVM can't auto vectorize. I don't have a debug build,
so I can't get the `-debug-only` information from llvm-opt/opt about
why it can't vectorize. The buffer address mangling should be hoisted
out of the loop (still doesn't enable auto vectorization) [1]. The
buffer juggling should b
Ben Kietzman created ARROW-7373:
---
Summary: [C++][Dataset] Remove FileSource
Key: ARROW-7373
URL: https://issues.apache.org/jira/browse/ARROW-7373
Project: Apache Arrow
Issue Type: Improvement
Antoine Pitrou created ARROW-7372:
-
Summary: [C++] Allow creating dictionary array from simple JSON
Key: ARROW-7372
URL: https://issues.apache.org/jira/browse/ARROW-7372
Project: Apache Arrow
Arrow Build Report for Job nightly-2019-12-11-0
All tasks:
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-11-0
Failed Tasks:
- conda-osx-clang-py27:
URL:
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-12-11-0-azure-conda-osx-clang-py27
- cond
Kenta Murata created ARROW-7371:
---
Summary: [GLib] Add Datasets binding
Key: ARROW-7371
URL: https://issues.apache.org/jira/browse/ARROW-7371
Project: Apache Arrow
Issue Type: New Feature
Hi,
I'm trying to figure out how Gandiva works by tracing unit test
TestSimpleArichmetic[1].
I met with a problem about Gandiva IR generator and optimizer, would like to
seek for
help from community.
I'm focusing on case "b+1", which adds 1 to each element of an int32 vector. I
see
there's a
29 matches
Mail list logo