Compute kernels and Gandiva operators

2019-02-12 Thread Ravindra Pindikura
Hi,

I was looking at the recent checkin for arrow kernels, and started to think of 
how they would work alongside Gandiva.

Here are my thoughts :

1. Gandiva already has two high-level operators namely project and filter, with 
runtime code generation

* It already supports 100s of functions (eg. a+b, a > b), which can be combined 
into expressions (eg. a+b > c && a +b < d) for each of the operators and we’ll 
likely continue to add more of them.
* it works on one record batch at a time - consumes a record batch, and 
produces a record batch.
* The operators can be inter-linked (eg. Project -> filter -> project) to build 
a pipeline.
* we may build additional operators in the future which could benefit from code 
generation (eg. Impala uses code generation when parsing Avro files).

2. Arrow Kernels 

a. support project/filter operators

Useful for functions where there is no benefit with code generation, or where 
code generation can be skipped over (eager evaluation).

b. Support for additional operators like aggregates


How do we combine and link the gandiva operators and the kernels ? For eg. It 
would be nice to have a pipeline with scan (read from source),  project 
(expression on column), filter (extract rows), and aggregate (sum on the 
extracted column).

To do this, I think we would need to be able build a pipeline with high level 
operators that move along data one record batch at a time :
- source operator which only produces record-batches (maybe, csv reader)
- intermediate operators that can produce/consume record-batches (maybe, 
gandiva project operator)
- terminal operators that emit the final output (from the end of the pipeline) 
when there is nothing left to consume (maybe, SumKernel)

Are we thinking along these lines ?

Thanks & regards,
Ravindra.

[jira] [Created] (ARROW-4559) pyarrow can't read/write filenames with special characters

2019-02-12 Thread Jean-Christophe Petkovich (JIRA)
Jean-Christophe Petkovich created ARROW-4559:


 Summary: pyarrow can't read/write filenames with special characters
 Key: ARROW-4559
 URL: https://issues.apache.org/jira/browse/ARROW-4559
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
 Environment: $ python3 --version
Python 3.6.6
$ pip3 freeze | grep -Ei 'pyarrow|pandas'
pandas==0.24.1
pyarrow==0.12.0

Reporter: Jean-Christophe Petkovich


When writing or reading files to or from paths that have special characters in 
them, (e.g., "#"), pyarrow returns an error: 

{code:python}
OSError: Passed non-file path...
{code}

This is a consequence of the following line:
https://github.com/apache/arrow/blob/master/python/pyarrow/filesystem.py#L416

File-paths will be parsed as URIs, which will give strange results for 
filepaths like: "bad # actor.parquet":

ParseResult(scheme='', netloc='', path='/tmp/bad ', params='', query='', 
fragment='actor.parquet')

This is trivial to reproduce with the following code which uses the 
`pd.to_parquet` and `pd.read_parquet` interfaces:

{code:python}
import pandas as pd
x = pd.DataFrame({"a": [1,2,3]})
x.to_parquet("bad # actor.parquet")
x.read_parquet("bad # actor.parquet")
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4558) [C++][Flight] Avoid undefined behavior with gRPC memory optimizations

2019-02-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4558:
---

 Summary: [C++][Flight] Avoid undefined behavior with gRPC memory 
optimizations
 Key: ARROW-4558
 URL: https://issues.apache.org/jira/browse/ARROW-4558
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.13.0


Because the {{Write}} function and other on {{ServerWriter}} and 
{{ClientReader}} are declared virtual, some compilers may not behave in the way 
we want. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4557) [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` method

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4557:
--

 Summary: [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` 
method
 Key: ARROW-4557
 URL: https://issues.apache.org/jira/browse/ARROW-4557
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.5.0


Presently Table, Schema, and RecordBatch have basic {{select(...colNames)}} 
implementations. Having an easy {{selectAt(...colIndices)}} impl would be a 
nice complement, especially when there are duplicate column names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4556) [Rust] Preserve order of JSON inferred schema

2019-02-12 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4556:
-

 Summary: [Rust] Preserve order of JSON inferred schema
 Key: ARROW-4556
 URL: https://issues.apache.org/jira/browse/ARROW-4556
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


serde_json has the ability to preserve order of JSON records read. This feature 
might be necessary to ensure that schema inference returns a consistent order 
of fields each time.

I'd like to add it separately as I'd also need to update JSON tests in 
datatypes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4555) [JS] Add high-level Table and Column creation methods

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4555:
--

 Summary: [JS] Add high-level Table and Column creation methods
 Key: ARROW-4555
 URL: https://issues.apache.org/jira/browse/ARROW-4555
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


It'd be great to have a few high-level functions that implicitly create the 
Schema, RecordBatches, etc. from a Table and a list of Columns. For example:
{code:actionscript}
const table = Table.new(
  Column.new('foo', ...),
  Column.new('bar', ...)
);
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4554) [JS] Implement logic for combining Vectors with different lengths/chunksizes

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4554:
--

 Summary: [JS] Implement logic for combining Vectors with different 
lengths/chunksizes
 Key: ARROW-4554
 URL: https://issues.apache.org/jira/browse/ARROW-4554
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


We should add logic to combine and possibly slice/re-chunk and uniformly 
partition chunks into separate RecordBatches. This will make it easier to 
create Tables or RecordBatches from Vectors of different lengths. This is also 
necessary for {{Table#assign()}}. PR incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4553) [JS] Implement Schema/Field/DataType comparators

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4553:
--

 Summary: [JS] Implement Schema/Field/DataType comparators
 Key: ARROW-4553
 URL: https://issues.apache.org/jira/browse/ARROW-4553
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


Some basic type comparison logic is necessary for {{Table#assign()}}. PR 
incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4552) [JS] Table and Schema assign implementations

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4552:
--

 Summary: [JS] Table and Schema assign implementations
 Key: ARROW-4552
 URL: https://issues.apache.org/jira/browse/ARROW-4552
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor


It'd be really handy to have a basic {{assign}} methods on the Table and 
Schema. I've extracted and cleaned up some internal helper methods I have that 
does this. PR incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4551) [JS] Investigate using Symbols to access Row columns by index

2019-02-12 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4551:


 Summary: [JS] Investigate using Symbols to access Row columns by 
index
 Key: ARROW-4551
 URL: https://issues.apache.org/jira/browse/ARROW-4551
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


Can we use row[Symbol.for(0)] instead of row[0] in order to avoid collisions? 
What would the performance impact be?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4550) [JS] Fix AMD pattern

2019-02-12 Thread Dominik Moritz (JIRA)
Dominik Moritz created ARROW-4550:
-

 Summary: [JS] Fix AMD pattern
 Key: ARROW-4550
 URL: https://issues.apache.org/jira/browse/ARROW-4550
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Dominik Moritz






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Rust] Rust 0.13.0 release

2019-02-12 Thread Chao Sun
I’m also interested in the Parquet/Arrow integration and may help there.
This is however a relative large feature and I’m not sure if it can be done
in 0.13.

Another area I’d like to work in is high level Parquet writer support. This
issue has been discussed several times in the past. People should not need
to specify definition & repetition levels in order to write data in Parquet
format.

Chao



On Wed, Feb 13, 2019 at 10:24 AM paddy horan  wrote:

> Hi All,
>
> The focus for me for 0.13.0 is SIMD.  I would like to port all the "ops"
> in "array_ops" to the new "compute" module and leverage SIMD for them all.
> I have most of this done in various forks.
>
> Past 0.13.0 I would really like to work toward getting Rust running in the
> integration tests.  The thing I am most excited about regarding Arrow is
> the concept of defining computational libraries in say Rust and being able
> to use them from any implementation, pyarrow probably for me.  This all
> starts and ends with the integration tests.
>
> Also, Gandiva is fascinating I would love to have robust support for this
> in Rust (via bindings)...
>
> Regards,
> P
>
>
> 
> From: Neville Dipale 
> Sent: Tuesday, February 12, 2019 11:33 AM
> To: dev@arrow.apache.org
> Subject: Re: [Rust] Rust 0.13.0 release
>
> Thanks for bringing this up Andy.
>
> I'm unemployed/on recovery leave, so I've had some surplus time to work on
> Rust.
>
> There's a lot of features that I've wanted to work on, some which I've
> spent some time attempting, but struggled with. A few block additional work
> that I could contribute.
>
> In 0.13.0 and the release thereafter: I'd like to see:
>
> Date/time support. I've spent a lot of time trying to implement this, but I
> get the feeling that my Rust isn't good enough yet to pull this together.
>
> More IO support.
> I'm working on JSON reader, and want to work on JSON and CSV (continuing
> where you left off) writers after this.
> With date/time support, I can also work on date/time parsing so we can have
> these in CSV and JSON.
> Parquet support isn't on my radar at the moment. JSON and CSV are more
> commonly used, so I'm hoping that with concrete support for these, more
> people using Rust can choose to integrate Arrow. That could bring us more
> hands to help.
>
> Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried
> working on it but failed. Related to this would be array chunking.
> I need these in order to be able to operate on "Tables" like CPP, Python
> and others. I've got ChunkedArray, Column and Table roughly implemented in
> my fork, but without zero-copy slicing, I can't upstream them.
>
> I've made good progress on scalar and array operations. I have trig
> functions, some string operators and other functions that one can run on a
> Spark-esque dataframe.
> These will fit in well with DataFusion's SQL operations, but from a
> decision-perspective, I think it would help if we join heads and think
> about the direction we want to take on compute.
>
> SIMD is great, and when Paddy's hashed out how it works, more of us will be
> able to contribute SIMD compatible compute operators.
>
> Thanks,
> Neville
>
> On Tue, 12 Feb 2019 at 18:12, Andy Grove  wrote:
>
> > I was curious what our Rust committers and contributors are excited about
> > for 0.13.0.
> >
> > The feature I would most like to see is that ability for DataFusion to
> run
> > SQL against Parquet files again, as that would give me an excuse for a
> PoC
> > in my day job using Arrow.
> >
> > I know there were some efforts underway to build arrow array readers for
> > Parquet and it would make sense for me to help there.
> >
> > I would also like to start building out some benchmarks.
> >
> > I think the SIMD work is exciting too.
> >
> > I'd like to hear thoughts from everyone else though since we're all
> coming
> > at this from different perspectives.
> >
> > Thanks,
> >
> > Andy.
> >
>


Re: [Rust] Rust 0.13.0 release

2019-02-12 Thread paddy horan
Hi All,

The focus for me for 0.13.0 is SIMD.  I would like to port all the "ops" in 
"array_ops" to the new "compute" module and leverage SIMD for them all.  I have 
most of this done in various forks.

Past 0.13.0 I would really like to work toward getting Rust running in the 
integration tests.  The thing I am most excited about regarding Arrow is the 
concept of defining computational libraries in say Rust and being able to use 
them from any implementation, pyarrow probably for me.  This all starts and 
ends with the integration tests.

Also, Gandiva is fascinating I would love to have robust support for this in 
Rust (via bindings)...

Regards,
P



From: Neville Dipale 
Sent: Tuesday, February 12, 2019 11:33 AM
To: dev@arrow.apache.org
Subject: Re: [Rust] Rust 0.13.0 release

Thanks for bringing this up Andy.

I'm unemployed/on recovery leave, so I've had some surplus time to work on
Rust.

There's a lot of features that I've wanted to work on, some which I've
spent some time attempting, but struggled with. A few block additional work
that I could contribute.

In 0.13.0 and the release thereafter: I'd like to see:

Date/time support. I've spent a lot of time trying to implement this, but I
get the feeling that my Rust isn't good enough yet to pull this together.

More IO support.
I'm working on JSON reader, and want to work on JSON and CSV (continuing
where you left off) writers after this.
With date/time support, I can also work on date/time parsing so we can have
these in CSV and JSON.
Parquet support isn't on my radar at the moment. JSON and CSV are more
commonly used, so I'm hoping that with concrete support for these, more
people using Rust can choose to integrate Arrow. That could bring us more
hands to help.

Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried
working on it but failed. Related to this would be array chunking.
I need these in order to be able to operate on "Tables" like CPP, Python
and others. I've got ChunkedArray, Column and Table roughly implemented in
my fork, but without zero-copy slicing, I can't upstream them.

I've made good progress on scalar and array operations. I have trig
functions, some string operators and other functions that one can run on a
Spark-esque dataframe.
These will fit in well with DataFusion's SQL operations, but from a
decision-perspective, I think it would help if we join heads and think
about the direction we want to take on compute.

SIMD is great, and when Paddy's hashed out how it works, more of us will be
able to contribute SIMD compatible compute operators.

Thanks,
Neville

On Tue, 12 Feb 2019 at 18:12, Andy Grove  wrote:

> I was curious what our Rust committers and contributors are excited about
> for 0.13.0.
>
> The feature I would most like to see is that ability for DataFusion to run
> SQL against Parquet files again, as that would give me an excuse for a PoC
> in my day job using Arrow.
>
> I know there were some efforts underway to build arrow array readers for
> Parquet and it would make sense for me to help there.
>
> I would also like to start building out some benchmarks.
>
> I think the SIMD work is exciting too.
>
> I'd like to hear thoughts from everyone else though since we're all coming
> at this from different perspectives.
>
> Thanks,
>
> Andy.
>


[jira] [Created] (ARROW-4549) [C++] Can't build benchmark code on CUDA enabled build

2019-02-12 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4549:
---

 Summary: [C++] Can't build benchmark code on CUDA enabled build
 Key: ARROW-4549
 URL: https://issues.apache.org/jira/browse/ARROW-4549
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, GPU
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4548) [C++] run-clang-format.py is not supported on Windows

2019-02-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4548:
---

 Summary: [C++] run-clang-format.py is not supported on Windows
 Key: ARROW-4548
 URL: https://issues.apache.org/jira/browse/ARROW-4548
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Wes McKinney


I tried to fix it but no matter what option I pass for {{--line-ending}} to 
{{cmake-format}} it converts LF line endings to CRLF. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4547) [Python][Documentation] Update python/development.rst with instructions for CUDA-enabled builds

2019-02-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4547:
---

 Summary: [Python][Documentation] Update python/development.rst 
with instructions for CUDA-enabled builds
 Key: ARROW-4547
 URL: https://issues.apache.org/jira/browse/ARROW-4547
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Wes McKinney
 Fix For: 0.13.0


Building a CUDA-enabled install is not documented



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Wes McKinney
On Tue, Feb 12, 2019 at 3:46 PM Antoine Pitrou  wrote:
>
>
> Le 12/02/2019 à 22:34, Wes McKinney a écrit :
> > On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou  wrote:
> >>
> >>
> >> Hi David,
> >>
> >> I think allowing to send application-specific ancillary data in addition
> >> to Arrow data makes sense.
> >>
> >> (I'm also wondering whether the choice of gRPC is appropriate at all -
> >> the current C++ hacks around "zero-copy" are not pretty and they may not
> >> translate to other languages either)
> >>
> >
> > This is unrelated to the discussion of extending the Flight protocol,
> > but I'm not sure I would describe the serialization optimizations that
> > have been implemented as "hacks". gRPC exposes its message
> > serialization layer among other things to permit extensibility and to
> > not require the use of Protocol Buffers necessarily.
>
> One thing that surfaced is that the current implementation relies on C++
> undefined behaviour (the reinterpret_cast from pb::FlightData to the
> unrelated struct FlightData).  I don't know if there's a way to
> reimplement the optimization without that cast, but otherwise it's cause
> for worry, IMHO.

Is there a JIRA about this? I spent some time looking around gRPC's
C++ library (which is header-only) and  AFAICT the only exposure of
the template parameter to any relevant part of the code is at the
SerializationTraits interface, so the two template types should be
internally isomorphic (but I am not a C++ language lawyer). There may
be a safer way to get the library to generate the code we are looking
for. Note that the initial C++ implementation was written over a short
period of a few days; my goal was to get something working and do more
research later

>
> > The reason that we chose to use the Protobuf wire format for all
> > message types, including data, is that there is excellent
> > cross-language support for protobufs, and among production-ready RPC
> > frameworks, gRPC has the most robust language support, covering pretty
> > much all the languages we care about:
> > https://github.com/grpc/grpc#to-start-using-grpc. The only one missing
> > is Rust, and I reckon that will get rectified at some point (there is
> > already https://github.com/stepancheg/grpc-rust, maybe it will be
> > adopted into gRPC formally at some point). But to have C++, C#, Go,
> > Java, and Node officially supported out of the box is not nothing. I
> > think it would be unwise to go a different way unless you have some
> > compelling reason that gRPC / HTTP/2 is fundamentally flawed this this
> > intended use.
>
> Since our use case pretty much requires high-performance transmission
> with as few copies as possible (ideally, data should be directly sent
> from/received to Arrow buffers without any intermediate userspace
> copies), I think we should evaluate whether gRPC can allow us to achieve
> that (there are still copies currently, AFAICT), and at which cost.
>
> As a side note, the Flight C++ benchmark currently achieves a bit more
> than 2 GB/s here.  There may be ways to improve this number (does gRPC
> enable TLS by default? does it compress by default?)...
>

One design question as we work on this project is how one could open a
"side channel" of sorts for moving the dataset itself outside of gRPC
but still using the flexible command layer

> Regards
>
> Antoine.


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Antoine Pitrou


Le 12/02/2019 à 22:34, Wes McKinney a écrit :
> On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou  wrote:
>>
>>
>> Hi David,
>>
>> I think allowing to send application-specific ancillary data in addition
>> to Arrow data makes sense.
>>
>> (I'm also wondering whether the choice of gRPC is appropriate at all -
>> the current C++ hacks around "zero-copy" are not pretty and they may not
>> translate to other languages either)
>>
> 
> This is unrelated to the discussion of extending the Flight protocol,
> but I'm not sure I would describe the serialization optimizations that
> have been implemented as "hacks". gRPC exposes its message
> serialization layer among other things to permit extensibility and to
> not require the use of Protocol Buffers necessarily.

One thing that surfaced is that the current implementation relies on C++
undefined behaviour (the reinterpret_cast from pb::FlightData to the
unrelated struct FlightData).  I don't know if there's a way to
reimplement the optimization without that cast, but otherwise it's cause
for worry, IMHO.

> The reason that we chose to use the Protobuf wire format for all
> message types, including data, is that there is excellent
> cross-language support for protobufs, and among production-ready RPC
> frameworks, gRPC has the most robust language support, covering pretty
> much all the languages we care about:
> https://github.com/grpc/grpc#to-start-using-grpc. The only one missing
> is Rust, and I reckon that will get rectified at some point (there is
> already https://github.com/stepancheg/grpc-rust, maybe it will be
> adopted into gRPC formally at some point). But to have C++, C#, Go,
> Java, and Node officially supported out of the box is not nothing. I
> think it would be unwise to go a different way unless you have some
> compelling reason that gRPC / HTTP/2 is fundamentally flawed this this
> intended use.

Since our use case pretty much requires high-performance transmission
with as few copies as possible (ideally, data should be directly sent
from/received to Arrow buffers without any intermediate userspace
copies), I think we should evaluate whether gRPC can allow us to achieve
that (there are still copies currently, AFAICT), and at which cost.

As a side note, the Flight C++ benchmark currently achieves a bit more
than 2 GB/s here.  There may be ways to improve this number (does gRPC
enable TLS by default? does it compress by default?)...

Regards

Antoine.


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Wes McKinney
Even if zeromq did make more sense, we couldn't take it on as a
dependency because of non-ASF-compatible licenses

Java zeromq: MPL 2.0
libzmq: GPL

On Tue, Feb 12, 2019 at 3:33 PM Jonathan Chiang  wrote:
>
> Would zeromq make more sense than gRPC?
>
> Thanks,
> Jonathan
>
> > On Feb 12, 2019, at 12:48 PM, Antoine Pitrou  wrote:
> >
> >
> > Hi David,
> >
> > I think allowing to send application-specific ancillary data in addition
> > to Arrow data makes sense.
> >
> > (I'm also wondering whether the choice of gRPC is appropriate at all -
> > the current C++ hacks around "zero-copy" are not pretty and they may not
> > translate to other languages either)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >> Le 12/02/2019 à 21:44, David Ming Li a écrit :
> >> Hi all,
> >>
> >>
> >>
> >> We've been evaluating Flight for our use, and we're wondering if the 
> >> protocol is still open to extensions, as having a few application-defined 
> >> metadata fields would help our use cases a lot.
> >>
> >>
> >>
> >> (Apologies if this is a repost - was having issue with the spam filter.)
> >>
> >>
> >>
> >> Specifically, in DoGet, having a metadata binary blob in the 
> >> server->client messages would help implement resumable requests, 
> >> especially as we have non-monotonically-indexed data streams. This would 
> >> also help us reuse server-side state if we do have to resume a stream.
> >>
> >>
> >>
> >> In DoPut, we think making this call bidirectional would be useful to 
> >> support application-level ACKs, again to implement resumable uploads. The 
> >> server would thus have the option to send back an application-defined 
> >> binary blob at any point during an upload. This is less important, as you 
> >> could imagine starting a plain gRPC server-streaming call alongside the 
> >> Flight DoPut call to do the same. But as you can't bind a gRPC and Flight 
> >> service on the same port/channel, this is somewhat inconvenient.
> >>
> >>
> >>
> >> That leads me to the API-level niggles we have; it would be nice to be 
> >> able to bind gRPC services alongside a Flight service, and conversely be 
> >> able to reuse a gRPC channel across gRPC and Flight clients, though 
> >> breaking the hiding of gRPC isn't desirable.
> >>
> >>
> >>
> >> Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' 
> >> methods, so that we don't have to busy-wait ourselves (as in Java) or have 
> >> the option to not busy-wait taken away from us (as in C++). In particular, 
> >> when investigating Python bindings to C++ [0], the fact that 
> >> FlightServerBase::Run also calls grpc::Server::Wait for you means that 
> >> Ctrl-C no longer works in Python.
> >>
> >>
> >>
> >> Does what we're trying to accomplish make sense? Are there better ways to 
> >> achieve resumable uploads/downloads in the current protocol?
> >>
> >>
> >>
> >> [0]: https://github.com/apache/arrow/pull/3566
> >>
> >>
> >>
> >> Thanks,
> >>
> >> David
> >>
> >>


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Wes McKinney
On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou  wrote:
>
>
> Hi David,
>
> I think allowing to send application-specific ancillary data in addition
> to Arrow data makes sense.
>
> (I'm also wondering whether the choice of gRPC is appropriate at all -
> the current C++ hacks around "zero-copy" are not pretty and they may not
> translate to other languages either)
>

This is unrelated to the discussion of extending the Flight protocol,
but I'm not sure I would describe the serialization optimizations that
have been implemented as "hacks". gRPC exposes its message
serialization layer among other things to permit extensibility and to
not require the use of Protocol Buffers necessarily.

The reason that we chose to use the Protobuf wire format for all
message types, including data, is that there is excellent
cross-language support for protobufs, and among production-ready RPC
frameworks, gRPC has the most robust language support, covering pretty
much all the languages we care about:
https://github.com/grpc/grpc#to-start-using-grpc. The only one missing
is Rust, and I reckon that will get rectified at some point (there is
already https://github.com/stepancheg/grpc-rust, maybe it will be
adopted into gRPC formally at some point). But to have C++, C#, Go,
Java, and Node officially supported out of the box is not nothing. I
think it would be unwise to go a different way unless you have some
compelling reason that gRPC / HTTP/2 is fundamentally flawed this this
intended use.

For the FlightData message in particular, if a particular Flight
client is unconcerned with memory optimizations, they can not bother
with it and simply leave the serialization to their Protocol Buffers
implementation. This also means that Arrow-agnostic gRPC clients can
interact with Flight services using only the Flight.proto and some
knowledge about what commands the server provides.

In speaking with others parties about Flight, there is some interest
in supporting different underlying data movement schemes while
preserving the gRPC command layer, e.g. optimizing for high-bandwidth
networking like infiniband.

- Wes

> Regards
>
> Antoine.
>
>
> Le 12/02/2019 à 21:44, David Ming Li a écrit :
> > Hi all,
> >
> >
> >
> > We've been evaluating Flight for our use, and we're wondering if the 
> > protocol is still open to extensions, as having a few application-defined 
> > metadata fields would help our use cases a lot.
> >
> >
> >
> > (Apologies if this is a repost - was having issue with the spam filter.)
> >
> >
> >
> > Specifically, in DoGet, having a metadata binary blob in the server->client 
> > messages would help implement resumable requests, especially as we have 
> > non-monotonically-indexed data streams. This would also help us reuse 
> > server-side state if we do have to resume a stream.
> >
> >
> >
> > In DoPut, we think making this call bidirectional would be useful to 
> > support application-level ACKs, again to implement resumable uploads. The 
> > server would thus have the option to send back an application-defined 
> > binary blob at any point during an upload. This is less important, as you 
> > could imagine starting a plain gRPC server-streaming call alongside the 
> > Flight DoPut call to do the same. But as you can't bind a gRPC and Flight 
> > service on the same port/channel, this is somewhat inconvenient.
> >
> >
> >
> > That leads me to the API-level niggles we have; it would be nice to be able 
> > to bind gRPC services alongside a Flight service, and conversely be able to 
> > reuse a gRPC channel across gRPC and Flight clients, though breaking the 
> > hiding of gRPC isn't desirable.
> >
> >
> >
> > Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' 
> > methods, so that we don't have to busy-wait ourselves (as in Java) or have 
> > the option to not busy-wait taken away from us (as in C++). In particular, 
> > when investigating Python bindings to C++ [0], the fact that 
> > FlightServerBase::Run also calls grpc::Server::Wait for you means that 
> > Ctrl-C no longer works in Python.
> >
> >
> >
> > Does what we're trying to accomplish make sense? Are there better ways to 
> > achieve resumable uploads/downloads in the current protocol?
> >
> >
> >
> > [0]: https://github.com/apache/arrow/pull/3566
> >
> >
> >
> > Thanks,
> >
> > David
> >
> >


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Jonathan Chiang
Would zeromq make more sense than gRPC? 

Thanks,
Jonathan 

> On Feb 12, 2019, at 12:48 PM, Antoine Pitrou  wrote:
> 
> 
> Hi David,
> 
> I think allowing to send application-specific ancillary data in addition
> to Arrow data makes sense.
> 
> (I'm also wondering whether the choice of gRPC is appropriate at all -
> the current C++ hacks around "zero-copy" are not pretty and they may not
> translate to other languages either)
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 12/02/2019 à 21:44, David Ming Li a écrit :
>> Hi all,
>> 
>> 
>> 
>> We've been evaluating Flight for our use, and we're wondering if the 
>> protocol is still open to extensions, as having a few application-defined 
>> metadata fields would help our use cases a lot.
>> 
>> 
>> 
>> (Apologies if this is a repost - was having issue with the spam filter.)
>> 
>> 
>> 
>> Specifically, in DoGet, having a metadata binary blob in the server->client 
>> messages would help implement resumable requests, especially as we have 
>> non-monotonically-indexed data streams. This would also help us reuse 
>> server-side state if we do have to resume a stream.
>> 
>> 
>> 
>> In DoPut, we think making this call bidirectional would be useful to support 
>> application-level ACKs, again to implement resumable uploads. The server 
>> would thus have the option to send back an application-defined binary blob 
>> at any point during an upload. This is less important, as you could imagine 
>> starting a plain gRPC server-streaming call alongside the Flight DoPut call 
>> to do the same. But as you can't bind a gRPC and Flight service on the same 
>> port/channel, this is somewhat inconvenient.
>> 
>> 
>> 
>> That leads me to the API-level niggles we have; it would be nice to be able 
>> to bind gRPC services alongside a Flight service, and conversely be able to 
>> reuse a gRPC channel across gRPC and Flight clients, though breaking the 
>> hiding of gRPC isn't desirable.
>> 
>> 
>> 
>> Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' 
>> methods, so that we don't have to busy-wait ourselves (as in Java) or have 
>> the option to not busy-wait taken away from us (as in C++). In particular, 
>> when investigating Python bindings to C++ [0], the fact that 
>> FlightServerBase::Run also calls grpc::Server::Wait for you means that 
>> Ctrl-C no longer works in Python.
>> 
>> 
>> 
>> Does what we're trying to accomplish make sense? Are there better ways to 
>> achieve resumable uploads/downloads in the current protocol?
>> 
>> 
>> 
>> [0]: https://github.com/apache/arrow/pull/3566
>> 
>> 
>> 
>> Thanks,
>> 
>> David
>> 
>> 


Re: Arrow Flight protocol/API questions

2019-02-12 Thread Antoine Pitrou


Hi David,

I think allowing to send application-specific ancillary data in addition
to Arrow data makes sense.

(I'm also wondering whether the choice of gRPC is appropriate at all -
the current C++ hacks around "zero-copy" are not pretty and they may not
translate to other languages either)

Regards

Antoine.


Le 12/02/2019 à 21:44, David Ming Li a écrit :
> Hi all,
> 
> 
> 
> We've been evaluating Flight for our use, and we're wondering if the protocol 
> is still open to extensions, as having a few application-defined metadata 
> fields would help our use cases a lot.
> 
> 
> 
> (Apologies if this is a repost - was having issue with the spam filter.)
> 
> 
> 
> Specifically, in DoGet, having a metadata binary blob in the server->client 
> messages would help implement resumable requests, especially as we have 
> non-monotonically-indexed data streams. This would also help us reuse 
> server-side state if we do have to resume a stream.
> 
> 
> 
> In DoPut, we think making this call bidirectional would be useful to support 
> application-level ACKs, again to implement resumable uploads. The server 
> would thus have the option to send back an application-defined binary blob at 
> any point during an upload. This is less important, as you could imagine 
> starting a plain gRPC server-streaming call alongside the Flight DoPut call 
> to do the same. But as you can't bind a gRPC and Flight service on the same 
> port/channel, this is somewhat inconvenient.
> 
> 
> 
> That leads me to the API-level niggles we have; it would be nice to be able 
> to bind gRPC services alongside a Flight service, and conversely be able to 
> reuse a gRPC channel across gRPC and Flight clients, though breaking the 
> hiding of gRPC isn't desirable.
> 
> 
> 
> Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' 
> methods, so that we don't have to busy-wait ourselves (as in Java) or have 
> the option to not busy-wait taken away from us (as in C++). In particular, 
> when investigating Python bindings to C++ [0], the fact that 
> FlightServerBase::Run also calls grpc::Server::Wait for you means that Ctrl-C 
> no longer works in Python.
> 
> 
> 
> Does what we're trying to accomplish make sense? Are there better ways to 
> achieve resumable uploads/downloads in the current protocol?
> 
> 
> 
> [0]: https://github.com/apache/arrow/pull/3566
> 
> 
> 
> Thanks,
> 
> David
> 
> 


Arrow Flight protocol/API questions

2019-02-12 Thread David Ming Li
Hi all,



We've been evaluating Flight for our use, and we're wondering if the protocol 
is still open to extensions, as having a few application-defined metadata 
fields would help our use cases a lot.



(Apologies if this is a repost - was having issue with the spam filter.)



Specifically, in DoGet, having a metadata binary blob in the server->client 
messages would help implement resumable requests, especially as we have 
non-monotonically-indexed data streams. This would also help us reuse 
server-side state if we do have to resume a stream.



In DoPut, we think making this call bidirectional would be useful to support 
application-level ACKs, again to implement resumable uploads. The server would 
thus have the option to send back an application-defined binary blob at any 
point during an upload. This is less important, as you could imagine starting a 
plain gRPC server-streaming call alongside the Flight DoPut call to do the 
same. But as you can't bind a gRPC and Flight service on the same port/channel, 
this is somewhat inconvenient.



That leads me to the API-level niggles we have; it would be nice to be able to 
bind gRPC services alongside a Flight service, and conversely be able to reuse 
a gRPC channel across gRPC and Flight clients, though breaking the hiding of 
gRPC isn't desirable.



Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' methods, 
so that we don't have to busy-wait ourselves (as in Java) or have the option to 
not busy-wait taken away from us (as in C++). In particular, when investigating 
Python bindings to C++ [0], the fact that FlightServerBase::Run also calls 
grpc::Server::Wait for you means that Ctrl-C no longer works in Python.



Does what we're trying to accomplish make sense? Are there better ways to 
achieve resumable uploads/downloads in the current protocol?



[0]: https://github.com/apache/arrow/pull/3566



Thanks,

David



[jira] [Created] (ARROW-4546) LICENSE.txt should be updated.

2019-02-12 Thread Renat Valiullin (JIRA)
Renat Valiullin created ARROW-4546:
--

 Summary: LICENSE.txt should be updated.
 Key: ARROW-4546
 URL: https://issues.apache.org/jira/browse/ARROW-4546
 Project: Apache Arrow
  Issue Type: Task
Reporter: Renat Valiullin


parquet-cpp/blob/master/LICENSE.txt is not mentioned there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4545) [C#] Extend Append/AppendRange in BinaryArray to support building rows

2019-02-12 Thread Chris Hutchinson (JIRA)
Chris Hutchinson created ARROW-4545:
---

 Summary: [C#] Extend Append/AppendRange in BinaryArray to support 
building rows
 Key: ARROW-4545
 URL: https://issues.apache.org/jira/browse/ARROW-4545
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Chris Hutchinson
 Fix For: 0.13.0


This is a proposal to extend BinaryArray to provide the ability to call 
Append/AppendRange to grow individual rows during array building, and to expose 
values in ArrowBuffer.Builder through a property to facilitate algorithms 
that require introspecting the buffer data (sorting, filtering) when building 
an array.

*Example:*
{code:java}
var builder = new BinaryArray.Builder()
     .Append(10, false)     
 .Append(20, false)
 .Mark();
builder.Append(builder.Values[0], true);
var array = builder.Build();

// General idea:
//
// 1. Append byte (10) to current element (0)
// 2. Append byte (20) to current element (0)
// 3. Mark end of the row 
// 4. Append byte (10) to current element (1)

// Constructs a binary array with 2 elements:
//
// [0] 10, 20
// [1] 10{code}
 

 

This proposed change would add the concept of "current element" to the builder, 
which in the specification are separated by recording the value offset. 
Append(true) appends one or more bytes to the current element and then marks 
the element as completed. Append(false) appends one or more bytes to the 
current element; Mark is required to signal to the builder that the current 
element is complete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Rust] Rust 0.13.0 release

2019-02-12 Thread Neville Dipale
Thanks for bringing this up Andy.

I'm unemployed/on recovery leave, so I've had some surplus time to work on
Rust.

There's a lot of features that I've wanted to work on, some which I've
spent some time attempting, but struggled with. A few block additional work
that I could contribute.

In 0.13.0 and the release thereafter: I'd like to see:

Date/time support. I've spent a lot of time trying to implement this, but I
get the feeling that my Rust isn't good enough yet to pull this together.

More IO support.
I'm working on JSON reader, and want to work on JSON and CSV (continuing
where you left off) writers after this.
With date/time support, I can also work on date/time parsing so we can have
these in CSV and JSON.
Parquet support isn't on my radar at the moment. JSON and CSV are more
commonly used, so I'm hoping that with concrete support for these, more
people using Rust can choose to integrate Arrow. That could bring us more
hands to help.

Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried
working on it but failed. Related to this would be array chunking.
I need these in order to be able to operate on "Tables" like CPP, Python
and others. I've got ChunkedArray, Column and Table roughly implemented in
my fork, but without zero-copy slicing, I can't upstream them.

I've made good progress on scalar and array operations. I have trig
functions, some string operators and other functions that one can run on a
Spark-esque dataframe.
These will fit in well with DataFusion's SQL operations, but from a
decision-perspective, I think it would help if we join heads and think
about the direction we want to take on compute.

SIMD is great, and when Paddy's hashed out how it works, more of us will be
able to contribute SIMD compatible compute operators.

Thanks,
Neville

On Tue, 12 Feb 2019 at 18:12, Andy Grove  wrote:

> I was curious what our Rust committers and contributors are excited about
> for 0.13.0.
>
> The feature I would most like to see is that ability for DataFusion to run
> SQL against Parquet files again, as that would give me an excuse for a PoC
> in my day job using Arrow.
>
> I know there were some efforts underway to build arrow array readers for
> Parquet and it would make sense for me to help there.
>
> I would also like to start building out some benchmarks.
>
> I think the SIMD work is exciting too.
>
> I'd like to hear thoughts from everyone else though since we're all coming
> at this from different perspectives.
>
> Thanks,
>
> Andy.
>


[Rust] Rust 0.13.0 release

2019-02-12 Thread Andy Grove
I was curious what our Rust committers and contributors are excited about
for 0.13.0.

The feature I would most like to see is that ability for DataFusion to run
SQL against Parquet files again, as that would give me an excuse for a PoC
in my day job using Arrow.

I know there were some efforts underway to build arrow array readers for
Parquet and it would make sense for me to help there.

I would also like to start building out some benchmarks.

I think the SIMD work is exciting too.

I'd like to hear thoughts from everyone else though since we're all coming
at this from different perspectives.

Thanks,

Andy.


[jira] [Created] (ARROW-4544) [Rust] Read nested JSON structs into StructArrays

2019-02-12 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4544:
-

 Summary: [Rust] Read nested JSON structs into StructArrays
 Key: ARROW-4544
 URL: https://issues.apache.org/jira/browse/ARROW-4544
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


_Adding this as a separate task as it's a bit involved._

Add the ability to read in JSON structs that are children of the JSON record 
being read.
The main concern here is deeply nested structures, which will require a 
performant and reusable basic JSON reader before dealing with recursion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4543) [C#] Update Flat Buffers code to latest version

2019-02-12 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-4543:
---

 Summary: [C#] Update Flat Buffers code to latest version
 Key: ARROW-4543
 URL: https://issues.apache.org/jira/browse/ARROW-4543
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt
Assignee: Eric Erhardt


In order to support zero-copy reads, we should update to the latest Google Flat 
Buffers code. A recent change now allows [C# support for directly reading and 
writing to memory other than 
byte|https://github.com/google/flatbuffers/pull/4886][] which will make reading 
native memory using `Memory` possible.

Along with this update, we should mark the flat buffers types as `internal`, 
since they are an implementation detail of the library. From an API 
perspective, it is confusing to see multiple public types named "Schema", 
"Field", "RecordBatch" etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4542) Denominate row group size in bytes (not in no of rows)

2019-02-12 Thread Remek Zajac (JIRA)
Remek Zajac created ARROW-4542:
--

 Summary: Denominate row group size in bytes (not in no of rows)
 Key: ARROW-4542
 URL: https://issues.apache.org/jira/browse/ARROW-4542
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Remek Zajac


Both the C++ [implementation of parquet writer for 
arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174]
 and the [Python code bound to 
it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911]
 appears denominated in the *number of rows* (without making it very explicit). 
Whereas:

(1) [The Apache parquet 
documentation|https://parquet.apache.org/documentation/latest/] states: 

"_Row group size: Larger row groups allow for larger column chunks which makes 
it possible to do larger sequential IO. Larger groups also require more 
buffering in the write path (or a two pass write). *We recommend large row 
groups (512MB - 1GB)*. Since an entire row group might need to be read, we want 
it to completely fit on one HDFS block. Therefore, HDFS block sizes should also 
be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS 
block size, 1 HDFS block per HDFS file._"

(2) Reference Apache [parquet-mr 
implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146]
 for Java accepts the row size expressed in bytes.

(3) The [low-level parquet read-write 
example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88]
 also considers row group be denominated in bytes.

These insights make me conclude that:
 * Per parquet design and to take advantage of HDFS block level operations, it 
only makes sense to work with row group sizes as expressed in bytes - as that 
the only consequential desire the caller can express and want to influence.
 * Arrow implementation of ParquetWriter would benefit from re-nominating its 
`row_group_size` into bytes.

Now, my conclusions can be wrong and I may be blind to some alley of reasoning, 
so this ticket is more of a question than a bug. A question on whether the 
audience here agrees with my reasoning and if not - to explain what detail I 
have missed.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4541) [Gandiva] Enable timestamp tests on windows platform

2019-02-12 Thread shyam narayan singh (JIRA)
shyam narayan singh created ARROW-4541:
--

 Summary: [Gandiva] Enable timestamp tests on windows platform
 Key: ARROW-4541
 URL: https://issues.apache.org/jira/browse/ARROW-4541
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: shyam narayan singh


As the timezone database is not available on windows operating system, the cast 
timestamp test cases that uses timezone apis are failing.

Tests are currently disabled on windows platform. Need to find a way to test 
the timezone apis on windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4540) [Rust] Add basic JSON reader

2019-02-12 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4540:
-

 Summary: [Rust] Add basic JSON reader
 Key: ARROW-4540
 URL: https://issues.apache.org/jira/browse/ARROW-4540
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


This is the first step in getting a JSON reader working in Rust



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4539) [Java]List vector child value count not set correctly

2019-02-12 Thread Praveen Kumar Desabandu (JIRA)
Praveen Kumar Desabandu created ARROW-4539:
--

 Summary: [Java]List vector child value count not set correctly
 Key: ARROW-4539
 URL: https://issues.apache.org/jira/browse/ARROW-4539
 Project: Apache Arrow
  Issue Type: Task
Reporter: Praveen Kumar Desabandu
Assignee: Praveen Kumar Desabandu
 Fix For: 0.14.0


We are not correctly processing list vectors that could have null values. The 
child value count would be off there by losing data in variable width vectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4538) pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset()

2019-02-12 Thread Christian Thiel (JIRA)
Christian Thiel created ARROW-4538:
--

 Summary: pa.Table.from_pandas() with df.index.name != None breaks 
write_to_dataset()
 Key: ARROW-4538
 URL: https://issues.apache.org/jira/browse/ARROW-4538
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
Reporter: Christian Thiel


When using {{pa.Table.from_pandas()}} with preserve_index=True and 
dataframe.index.name!=None the prefix {{__index_level_}} is not added to the 
respective schema name. This breaks {{write_to_dataset}} with active partition 
columns.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import os
import shutil
import pandas as pd
import numpy as np

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df['arrays'] = pd.Series(arrays)

df.index.name='ID'

table = pa.Table.from_pandas(df, preserve_index=True)
print(table.schema.names)

pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
partition_cols=['partition_column'],
preserve_index=True
   )
{code}
Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in 
{{write_to_dataset}} works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4537) [CI] Suppress shell warning on travis-ci

2019-02-12 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4537:
---

 Summary: [CI] Suppress shell warning on travis-ci
 Key: ARROW-4537
 URL: https://issues.apache.org/jira/browse/ARROW-4537
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Kenta Murata


Suppress shell warnings like:

{{+'[' == 1 ']'}}
{{/home/travis/build/apache/arrow/ci/travis_before_script_cpp.sh: line 81: [: 
==: unary operator expected}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)