Compute kernels and Gandiva operators
Hi, I was looking at the recent checkin for arrow kernels, and started to think of how they would work alongside Gandiva. Here are my thoughts : 1. Gandiva already has two high-level operators namely project and filter, with runtime code generation * It already supports 100s of functions (eg. a+b, a > b), which can be combined into expressions (eg. a+b > c && a +b < d) for each of the operators and we’ll likely continue to add more of them. * it works on one record batch at a time - consumes a record batch, and produces a record batch. * The operators can be inter-linked (eg. Project -> filter -> project) to build a pipeline. * we may build additional operators in the future which could benefit from code generation (eg. Impala uses code generation when parsing Avro files). 2. Arrow Kernels a. support project/filter operators Useful for functions where there is no benefit with code generation, or where code generation can be skipped over (eager evaluation). b. Support for additional operators like aggregates How do we combine and link the gandiva operators and the kernels ? For eg. It would be nice to have a pipeline with scan (read from source), project (expression on column), filter (extract rows), and aggregate (sum on the extracted column). To do this, I think we would need to be able build a pipeline with high level operators that move along data one record batch at a time : - source operator which only produces record-batches (maybe, csv reader) - intermediate operators that can produce/consume record-batches (maybe, gandiva project operator) - terminal operators that emit the final output (from the end of the pipeline) when there is nothing left to consume (maybe, SumKernel) Are we thinking along these lines ? Thanks & regards, Ravindra.
[jira] [Created] (ARROW-4559) pyarrow can't read/write filenames with special characters
Jean-Christophe Petkovich created ARROW-4559: Summary: pyarrow can't read/write filenames with special characters Key: ARROW-4559 URL: https://issues.apache.org/jira/browse/ARROW-4559 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Environment: $ python3 --version Python 3.6.6 $ pip3 freeze | grep -Ei 'pyarrow|pandas' pandas==0.24.1 pyarrow==0.12.0 Reporter: Jean-Christophe Petkovich When writing or reading files to or from paths that have special characters in them, (e.g., "#"), pyarrow returns an error: {code:python} OSError: Passed non-file path... {code} This is a consequence of the following line: https://github.com/apache/arrow/blob/master/python/pyarrow/filesystem.py#L416 File-paths will be parsed as URIs, which will give strange results for filepaths like: "bad # actor.parquet": ParseResult(scheme='', netloc='', path='/tmp/bad ', params='', query='', fragment='actor.parquet') This is trivial to reproduce with the following code which uses the `pd.to_parquet` and `pd.read_parquet` interfaces: {code:python} import pandas as pd x = pd.DataFrame({"a": [1,2,3]}) x.to_parquet("bad # actor.parquet") x.read_parquet("bad # actor.parquet") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4558) [C++][Flight] Avoid undefined behavior with gRPC memory optimizations
Wes McKinney created ARROW-4558: --- Summary: [C++][Flight] Avoid undefined behavior with gRPC memory optimizations Key: ARROW-4558 URL: https://issues.apache.org/jira/browse/ARROW-4558 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.13.0 Because the {{Write}} function and other on {{ServerWriter}} and {{ClientReader}} are declared virtual, some compilers may not behave in the way we want. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4557) [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` method
Paul Taylor created ARROW-4557: -- Summary: [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` method Key: ARROW-4557 URL: https://issues.apache.org/jira/browse/ARROW-4557 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: JS-0.4.0 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: JS-0.5.0 Presently Table, Schema, and RecordBatch have basic {{select(...colNames)}} implementations. Having an easy {{selectAt(...colIndices)}} impl would be a nice complement, especially when there are duplicate column names. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4556) [Rust] Preserve order of JSON inferred schema
Neville Dipale created ARROW-4556: - Summary: [Rust] Preserve order of JSON inferred schema Key: ARROW-4556 URL: https://issues.apache.org/jira/browse/ARROW-4556 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale serde_json has the ability to preserve order of JSON records read. This feature might be necessary to ensure that schema inference returns a consistent order of fields each time. I'd like to add it separately as I'd also need to update JSON tests in datatypes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4555) [JS] Add high-level Table and Column creation methods
Paul Taylor created ARROW-4555: -- Summary: [JS] Add high-level Table and Column creation methods Key: ARROW-4555 URL: https://issues.apache.org/jira/browse/ARROW-4555 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: 0.4.0 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: 0.4.1 It'd be great to have a few high-level functions that implicitly create the Schema, RecordBatches, etc. from a Table and a list of Columns. For example: {code:actionscript} const table = Table.new( Column.new('foo', ...), Column.new('bar', ...) ); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4554) [JS] Implement logic for combining Vectors with different lengths/chunksizes
Paul Taylor created ARROW-4554: -- Summary: [JS] Implement logic for combining Vectors with different lengths/chunksizes Key: ARROW-4554 URL: https://issues.apache.org/jira/browse/ARROW-4554 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: 0.4.0 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: 0.4.1 We should add logic to combine and possibly slice/re-chunk and uniformly partition chunks into separate RecordBatches. This will make it easier to create Tables or RecordBatches from Vectors of different lengths. This is also necessary for {{Table#assign()}}. PR incoming. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4553) [JS] Implement Schema/Field/DataType comparators
Paul Taylor created ARROW-4553: -- Summary: [JS] Implement Schema/Field/DataType comparators Key: ARROW-4553 URL: https://issues.apache.org/jira/browse/ARROW-4553 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: 0.4.0 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: 0.4.1 Some basic type comparison logic is necessary for {{Table#assign()}}. PR incoming. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4552) [JS] Table and Schema assign implementations
Paul Taylor created ARROW-4552: -- Summary: [JS] Table and Schema assign implementations Key: ARROW-4552 URL: https://issues.apache.org/jira/browse/ARROW-4552 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: 0.4.0 Reporter: Paul Taylor Assignee: Paul Taylor It'd be really handy to have a basic {{assign}} methods on the Table and Schema. I've extracted and cleaned up some internal helper methods I have that does this. PR incoming. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4551) [JS] Investigate using Symbols to access Row columns by index
Brian Hulette created ARROW-4551: Summary: [JS] Investigate using Symbols to access Row columns by index Key: ARROW-4551 URL: https://issues.apache.org/jira/browse/ARROW-4551 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Can we use row[Symbol.for(0)] instead of row[0] in order to avoid collisions? What would the performance impact be? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4550) [JS] Fix AMD pattern
Dominik Moritz created ARROW-4550: - Summary: [JS] Fix AMD pattern Key: ARROW-4550 URL: https://issues.apache.org/jira/browse/ARROW-4550 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Dominik Moritz -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Rust] Rust 0.13.0 release
I’m also interested in the Parquet/Arrow integration and may help there. This is however a relative large feature and I’m not sure if it can be done in 0.13. Another area I’d like to work in is high level Parquet writer support. This issue has been discussed several times in the past. People should not need to specify definition & repetition levels in order to write data in Parquet format. Chao On Wed, Feb 13, 2019 at 10:24 AM paddy horan wrote: > Hi All, > > The focus for me for 0.13.0 is SIMD. I would like to port all the "ops" > in "array_ops" to the new "compute" module and leverage SIMD for them all. > I have most of this done in various forks. > > Past 0.13.0 I would really like to work toward getting Rust running in the > integration tests. The thing I am most excited about regarding Arrow is > the concept of defining computational libraries in say Rust and being able > to use them from any implementation, pyarrow probably for me. This all > starts and ends with the integration tests. > > Also, Gandiva is fascinating I would love to have robust support for this > in Rust (via bindings)... > > Regards, > P > > > > From: Neville Dipale > Sent: Tuesday, February 12, 2019 11:33 AM > To: dev@arrow.apache.org > Subject: Re: [Rust] Rust 0.13.0 release > > Thanks for bringing this up Andy. > > I'm unemployed/on recovery leave, so I've had some surplus time to work on > Rust. > > There's a lot of features that I've wanted to work on, some which I've > spent some time attempting, but struggled with. A few block additional work > that I could contribute. > > In 0.13.0 and the release thereafter: I'd like to see: > > Date/time support. I've spent a lot of time trying to implement this, but I > get the feeling that my Rust isn't good enough yet to pull this together. > > More IO support. > I'm working on JSON reader, and want to work on JSON and CSV (continuing > where you left off) writers after this. > With date/time support, I can also work on date/time parsing so we can have > these in CSV and JSON. > Parquet support isn't on my radar at the moment. JSON and CSV are more > commonly used, so I'm hoping that with concrete support for these, more > people using Rust can choose to integrate Arrow. That could bring us more > hands to help. > > Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried > working on it but failed. Related to this would be array chunking. > I need these in order to be able to operate on "Tables" like CPP, Python > and others. I've got ChunkedArray, Column and Table roughly implemented in > my fork, but without zero-copy slicing, I can't upstream them. > > I've made good progress on scalar and array operations. I have trig > functions, some string operators and other functions that one can run on a > Spark-esque dataframe. > These will fit in well with DataFusion's SQL operations, but from a > decision-perspective, I think it would help if we join heads and think > about the direction we want to take on compute. > > SIMD is great, and when Paddy's hashed out how it works, more of us will be > able to contribute SIMD compatible compute operators. > > Thanks, > Neville > > On Tue, 12 Feb 2019 at 18:12, Andy Grove wrote: > > > I was curious what our Rust committers and contributors are excited about > > for 0.13.0. > > > > The feature I would most like to see is that ability for DataFusion to > run > > SQL against Parquet files again, as that would give me an excuse for a > PoC > > in my day job using Arrow. > > > > I know there were some efforts underway to build arrow array readers for > > Parquet and it would make sense for me to help there. > > > > I would also like to start building out some benchmarks. > > > > I think the SIMD work is exciting too. > > > > I'd like to hear thoughts from everyone else though since we're all > coming > > at this from different perspectives. > > > > Thanks, > > > > Andy. > > >
Re: [Rust] Rust 0.13.0 release
Hi All, The focus for me for 0.13.0 is SIMD. I would like to port all the "ops" in "array_ops" to the new "compute" module and leverage SIMD for them all. I have most of this done in various forks. Past 0.13.0 I would really like to work toward getting Rust running in the integration tests. The thing I am most excited about regarding Arrow is the concept of defining computational libraries in say Rust and being able to use them from any implementation, pyarrow probably for me. This all starts and ends with the integration tests. Also, Gandiva is fascinating I would love to have robust support for this in Rust (via bindings)... Regards, P From: Neville Dipale Sent: Tuesday, February 12, 2019 11:33 AM To: dev@arrow.apache.org Subject: Re: [Rust] Rust 0.13.0 release Thanks for bringing this up Andy. I'm unemployed/on recovery leave, so I've had some surplus time to work on Rust. There's a lot of features that I've wanted to work on, some which I've spent some time attempting, but struggled with. A few block additional work that I could contribute. In 0.13.0 and the release thereafter: I'd like to see: Date/time support. I've spent a lot of time trying to implement this, but I get the feeling that my Rust isn't good enough yet to pull this together. More IO support. I'm working on JSON reader, and want to work on JSON and CSV (continuing where you left off) writers after this. With date/time support, I can also work on date/time parsing so we can have these in CSV and JSON. Parquet support isn't on my radar at the moment. JSON and CSV are more commonly used, so I'm hoping that with concrete support for these, more people using Rust can choose to integrate Arrow. That could bring us more hands to help. Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried working on it but failed. Related to this would be array chunking. I need these in order to be able to operate on "Tables" like CPP, Python and others. I've got ChunkedArray, Column and Table roughly implemented in my fork, but without zero-copy slicing, I can't upstream them. I've made good progress on scalar and array operations. I have trig functions, some string operators and other functions that one can run on a Spark-esque dataframe. These will fit in well with DataFusion's SQL operations, but from a decision-perspective, I think it would help if we join heads and think about the direction we want to take on compute. SIMD is great, and when Paddy's hashed out how it works, more of us will be able to contribute SIMD compatible compute operators. Thanks, Neville On Tue, 12 Feb 2019 at 18:12, Andy Grove wrote: > I was curious what our Rust committers and contributors are excited about > for 0.13.0. > > The feature I would most like to see is that ability for DataFusion to run > SQL against Parquet files again, as that would give me an excuse for a PoC > in my day job using Arrow. > > I know there were some efforts underway to build arrow array readers for > Parquet and it would make sense for me to help there. > > I would also like to start building out some benchmarks. > > I think the SIMD work is exciting too. > > I'd like to hear thoughts from everyone else though since we're all coming > at this from different perspectives. > > Thanks, > > Andy. >
[jira] [Created] (ARROW-4549) [C++] Can't build benchmark code on CUDA enabled build
Kouhei Sutou created ARROW-4549: --- Summary: [C++] Can't build benchmark code on CUDA enabled build Key: ARROW-4549 URL: https://issues.apache.org/jira/browse/ARROW-4549 Project: Apache Arrow Issue Type: Bug Components: C++, GPU Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4548) [C++] run-clang-format.py is not supported on Windows
Wes McKinney created ARROW-4548: --- Summary: [C++] run-clang-format.py is not supported on Windows Key: ARROW-4548 URL: https://issues.apache.org/jira/browse/ARROW-4548 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Wes McKinney I tried to fix it but no matter what option I pass for {{--line-ending}} to {{cmake-format}} it converts LF line endings to CRLF. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4547) [Python][Documentation] Update python/development.rst with instructions for CUDA-enabled builds
Wes McKinney created ARROW-4547: --- Summary: [Python][Documentation] Update python/development.rst with instructions for CUDA-enabled builds Key: ARROW-4547 URL: https://issues.apache.org/jira/browse/ARROW-4547 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Wes McKinney Fix For: 0.13.0 Building a CUDA-enabled install is not documented -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Arrow Flight protocol/API questions
On Tue, Feb 12, 2019 at 3:46 PM Antoine Pitrou wrote: > > > Le 12/02/2019 à 22:34, Wes McKinney a écrit : > > On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou wrote: > >> > >> > >> Hi David, > >> > >> I think allowing to send application-specific ancillary data in addition > >> to Arrow data makes sense. > >> > >> (I'm also wondering whether the choice of gRPC is appropriate at all - > >> the current C++ hacks around "zero-copy" are not pretty and they may not > >> translate to other languages either) > >> > > > > This is unrelated to the discussion of extending the Flight protocol, > > but I'm not sure I would describe the serialization optimizations that > > have been implemented as "hacks". gRPC exposes its message > > serialization layer among other things to permit extensibility and to > > not require the use of Protocol Buffers necessarily. > > One thing that surfaced is that the current implementation relies on C++ > undefined behaviour (the reinterpret_cast from pb::FlightData to the > unrelated struct FlightData). I don't know if there's a way to > reimplement the optimization without that cast, but otherwise it's cause > for worry, IMHO. Is there a JIRA about this? I spent some time looking around gRPC's C++ library (which is header-only) and AFAICT the only exposure of the template parameter to any relevant part of the code is at the SerializationTraits interface, so the two template types should be internally isomorphic (but I am not a C++ language lawyer). There may be a safer way to get the library to generate the code we are looking for. Note that the initial C++ implementation was written over a short period of a few days; my goal was to get something working and do more research later > > > The reason that we chose to use the Protobuf wire format for all > > message types, including data, is that there is excellent > > cross-language support for protobufs, and among production-ready RPC > > frameworks, gRPC has the most robust language support, covering pretty > > much all the languages we care about: > > https://github.com/grpc/grpc#to-start-using-grpc. The only one missing > > is Rust, and I reckon that will get rectified at some point (there is > > already https://github.com/stepancheg/grpc-rust, maybe it will be > > adopted into gRPC formally at some point). But to have C++, C#, Go, > > Java, and Node officially supported out of the box is not nothing. I > > think it would be unwise to go a different way unless you have some > > compelling reason that gRPC / HTTP/2 is fundamentally flawed this this > > intended use. > > Since our use case pretty much requires high-performance transmission > with as few copies as possible (ideally, data should be directly sent > from/received to Arrow buffers without any intermediate userspace > copies), I think we should evaluate whether gRPC can allow us to achieve > that (there are still copies currently, AFAICT), and at which cost. > > As a side note, the Flight C++ benchmark currently achieves a bit more > than 2 GB/s here. There may be ways to improve this number (does gRPC > enable TLS by default? does it compress by default?)... > One design question as we work on this project is how one could open a "side channel" of sorts for moving the dataset itself outside of gRPC but still using the flexible command layer > Regards > > Antoine.
Re: Arrow Flight protocol/API questions
Le 12/02/2019 à 22:34, Wes McKinney a écrit : > On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou wrote: >> >> >> Hi David, >> >> I think allowing to send application-specific ancillary data in addition >> to Arrow data makes sense. >> >> (I'm also wondering whether the choice of gRPC is appropriate at all - >> the current C++ hacks around "zero-copy" are not pretty and they may not >> translate to other languages either) >> > > This is unrelated to the discussion of extending the Flight protocol, > but I'm not sure I would describe the serialization optimizations that > have been implemented as "hacks". gRPC exposes its message > serialization layer among other things to permit extensibility and to > not require the use of Protocol Buffers necessarily. One thing that surfaced is that the current implementation relies on C++ undefined behaviour (the reinterpret_cast from pb::FlightData to the unrelated struct FlightData). I don't know if there's a way to reimplement the optimization without that cast, but otherwise it's cause for worry, IMHO. > The reason that we chose to use the Protobuf wire format for all > message types, including data, is that there is excellent > cross-language support for protobufs, and among production-ready RPC > frameworks, gRPC has the most robust language support, covering pretty > much all the languages we care about: > https://github.com/grpc/grpc#to-start-using-grpc. The only one missing > is Rust, and I reckon that will get rectified at some point (there is > already https://github.com/stepancheg/grpc-rust, maybe it will be > adopted into gRPC formally at some point). But to have C++, C#, Go, > Java, and Node officially supported out of the box is not nothing. I > think it would be unwise to go a different way unless you have some > compelling reason that gRPC / HTTP/2 is fundamentally flawed this this > intended use. Since our use case pretty much requires high-performance transmission with as few copies as possible (ideally, data should be directly sent from/received to Arrow buffers without any intermediate userspace copies), I think we should evaluate whether gRPC can allow us to achieve that (there are still copies currently, AFAICT), and at which cost. As a side note, the Flight C++ benchmark currently achieves a bit more than 2 GB/s here. There may be ways to improve this number (does gRPC enable TLS by default? does it compress by default?)... Regards Antoine.
Re: Arrow Flight protocol/API questions
Even if zeromq did make more sense, we couldn't take it on as a dependency because of non-ASF-compatible licenses Java zeromq: MPL 2.0 libzmq: GPL On Tue, Feb 12, 2019 at 3:33 PM Jonathan Chiang wrote: > > Would zeromq make more sense than gRPC? > > Thanks, > Jonathan > > > On Feb 12, 2019, at 12:48 PM, Antoine Pitrou wrote: > > > > > > Hi David, > > > > I think allowing to send application-specific ancillary data in addition > > to Arrow data makes sense. > > > > (I'm also wondering whether the choice of gRPC is appropriate at all - > > the current C++ hacks around "zero-copy" are not pretty and they may not > > translate to other languages either) > > > > Regards > > > > Antoine. > > > > > >> Le 12/02/2019 à 21:44, David Ming Li a écrit : > >> Hi all, > >> > >> > >> > >> We've been evaluating Flight for our use, and we're wondering if the > >> protocol is still open to extensions, as having a few application-defined > >> metadata fields would help our use cases a lot. > >> > >> > >> > >> (Apologies if this is a repost - was having issue with the spam filter.) > >> > >> > >> > >> Specifically, in DoGet, having a metadata binary blob in the > >> server->client messages would help implement resumable requests, > >> especially as we have non-monotonically-indexed data streams. This would > >> also help us reuse server-side state if we do have to resume a stream. > >> > >> > >> > >> In DoPut, we think making this call bidirectional would be useful to > >> support application-level ACKs, again to implement resumable uploads. The > >> server would thus have the option to send back an application-defined > >> binary blob at any point during an upload. This is less important, as you > >> could imagine starting a plain gRPC server-streaming call alongside the > >> Flight DoPut call to do the same. But as you can't bind a gRPC and Flight > >> service on the same port/channel, this is somewhat inconvenient. > >> > >> > >> > >> That leads me to the API-level niggles we have; it would be nice to be > >> able to bind gRPC services alongside a Flight service, and conversely be > >> able to reuse a gRPC channel across gRPC and Flight clients, though > >> breaking the hiding of gRPC isn't desirable. > >> > >> > >> > >> Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' > >> methods, so that we don't have to busy-wait ourselves (as in Java) or have > >> the option to not busy-wait taken away from us (as in C++). In particular, > >> when investigating Python bindings to C++ [0], the fact that > >> FlightServerBase::Run also calls grpc::Server::Wait for you means that > >> Ctrl-C no longer works in Python. > >> > >> > >> > >> Does what we're trying to accomplish make sense? Are there better ways to > >> achieve resumable uploads/downloads in the current protocol? > >> > >> > >> > >> [0]: https://github.com/apache/arrow/pull/3566 > >> > >> > >> > >> Thanks, > >> > >> David > >> > >>
Re: Arrow Flight protocol/API questions
On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou wrote: > > > Hi David, > > I think allowing to send application-specific ancillary data in addition > to Arrow data makes sense. > > (I'm also wondering whether the choice of gRPC is appropriate at all - > the current C++ hacks around "zero-copy" are not pretty and they may not > translate to other languages either) > This is unrelated to the discussion of extending the Flight protocol, but I'm not sure I would describe the serialization optimizations that have been implemented as "hacks". gRPC exposes its message serialization layer among other things to permit extensibility and to not require the use of Protocol Buffers necessarily. The reason that we chose to use the Protobuf wire format for all message types, including data, is that there is excellent cross-language support for protobufs, and among production-ready RPC frameworks, gRPC has the most robust language support, covering pretty much all the languages we care about: https://github.com/grpc/grpc#to-start-using-grpc. The only one missing is Rust, and I reckon that will get rectified at some point (there is already https://github.com/stepancheg/grpc-rust, maybe it will be adopted into gRPC formally at some point). But to have C++, C#, Go, Java, and Node officially supported out of the box is not nothing. I think it would be unwise to go a different way unless you have some compelling reason that gRPC / HTTP/2 is fundamentally flawed this this intended use. For the FlightData message in particular, if a particular Flight client is unconcerned with memory optimizations, they can not bother with it and simply leave the serialization to their Protocol Buffers implementation. This also means that Arrow-agnostic gRPC clients can interact with Flight services using only the Flight.proto and some knowledge about what commands the server provides. In speaking with others parties about Flight, there is some interest in supporting different underlying data movement schemes while preserving the gRPC command layer, e.g. optimizing for high-bandwidth networking like infiniband. - Wes > Regards > > Antoine. > > > Le 12/02/2019 à 21:44, David Ming Li a écrit : > > Hi all, > > > > > > > > We've been evaluating Flight for our use, and we're wondering if the > > protocol is still open to extensions, as having a few application-defined > > metadata fields would help our use cases a lot. > > > > > > > > (Apologies if this is a repost - was having issue with the spam filter.) > > > > > > > > Specifically, in DoGet, having a metadata binary blob in the server->client > > messages would help implement resumable requests, especially as we have > > non-monotonically-indexed data streams. This would also help us reuse > > server-side state if we do have to resume a stream. > > > > > > > > In DoPut, we think making this call bidirectional would be useful to > > support application-level ACKs, again to implement resumable uploads. The > > server would thus have the option to send back an application-defined > > binary blob at any point during an upload. This is less important, as you > > could imagine starting a plain gRPC server-streaming call alongside the > > Flight DoPut call to do the same. But as you can't bind a gRPC and Flight > > service on the same port/channel, this is somewhat inconvenient. > > > > > > > > That leads me to the API-level niggles we have; it would be nice to be able > > to bind gRPC services alongside a Flight service, and conversely be able to > > reuse a gRPC channel across gRPC and Flight clients, though breaking the > > hiding of gRPC isn't desirable. > > > > > > > > Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' > > methods, so that we don't have to busy-wait ourselves (as in Java) or have > > the option to not busy-wait taken away from us (as in C++). In particular, > > when investigating Python bindings to C++ [0], the fact that > > FlightServerBase::Run also calls grpc::Server::Wait for you means that > > Ctrl-C no longer works in Python. > > > > > > > > Does what we're trying to accomplish make sense? Are there better ways to > > achieve resumable uploads/downloads in the current protocol? > > > > > > > > [0]: https://github.com/apache/arrow/pull/3566 > > > > > > > > Thanks, > > > > David > > > >
Re: Arrow Flight protocol/API questions
Would zeromq make more sense than gRPC? Thanks, Jonathan > On Feb 12, 2019, at 12:48 PM, Antoine Pitrou wrote: > > > Hi David, > > I think allowing to send application-specific ancillary data in addition > to Arrow data makes sense. > > (I'm also wondering whether the choice of gRPC is appropriate at all - > the current C++ hacks around "zero-copy" are not pretty and they may not > translate to other languages either) > > Regards > > Antoine. > > >> Le 12/02/2019 à 21:44, David Ming Li a écrit : >> Hi all, >> >> >> >> We've been evaluating Flight for our use, and we're wondering if the >> protocol is still open to extensions, as having a few application-defined >> metadata fields would help our use cases a lot. >> >> >> >> (Apologies if this is a repost - was having issue with the spam filter.) >> >> >> >> Specifically, in DoGet, having a metadata binary blob in the server->client >> messages would help implement resumable requests, especially as we have >> non-monotonically-indexed data streams. This would also help us reuse >> server-side state if we do have to resume a stream. >> >> >> >> In DoPut, we think making this call bidirectional would be useful to support >> application-level ACKs, again to implement resumable uploads. The server >> would thus have the option to send back an application-defined binary blob >> at any point during an upload. This is less important, as you could imagine >> starting a plain gRPC server-streaming call alongside the Flight DoPut call >> to do the same. But as you can't bind a gRPC and Flight service on the same >> port/channel, this is somewhat inconvenient. >> >> >> >> That leads me to the API-level niggles we have; it would be nice to be able >> to bind gRPC services alongside a Flight service, and conversely be able to >> reuse a gRPC channel across gRPC and Flight clients, though breaking the >> hiding of gRPC isn't desirable. >> >> >> >> Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' >> methods, so that we don't have to busy-wait ourselves (as in Java) or have >> the option to not busy-wait taken away from us (as in C++). In particular, >> when investigating Python bindings to C++ [0], the fact that >> FlightServerBase::Run also calls grpc::Server::Wait for you means that >> Ctrl-C no longer works in Python. >> >> >> >> Does what we're trying to accomplish make sense? Are there better ways to >> achieve resumable uploads/downloads in the current protocol? >> >> >> >> [0]: https://github.com/apache/arrow/pull/3566 >> >> >> >> Thanks, >> >> David >> >>
Re: Arrow Flight protocol/API questions
Hi David, I think allowing to send application-specific ancillary data in addition to Arrow data makes sense. (I'm also wondering whether the choice of gRPC is appropriate at all - the current C++ hacks around "zero-copy" are not pretty and they may not translate to other languages either) Regards Antoine. Le 12/02/2019 à 21:44, David Ming Li a écrit : > Hi all, > > > > We've been evaluating Flight for our use, and we're wondering if the protocol > is still open to extensions, as having a few application-defined metadata > fields would help our use cases a lot. > > > > (Apologies if this is a repost - was having issue with the spam filter.) > > > > Specifically, in DoGet, having a metadata binary blob in the server->client > messages would help implement resumable requests, especially as we have > non-monotonically-indexed data streams. This would also help us reuse > server-side state if we do have to resume a stream. > > > > In DoPut, we think making this call bidirectional would be useful to support > application-level ACKs, again to implement resumable uploads. The server > would thus have the option to send back an application-defined binary blob at > any point during an upload. This is less important, as you could imagine > starting a plain gRPC server-streaming call alongside the Flight DoPut call > to do the same. But as you can't bind a gRPC and Flight service on the same > port/channel, this is somewhat inconvenient. > > > > That leads me to the API-level niggles we have; it would be nice to be able > to bind gRPC services alongside a Flight service, and conversely be able to > reuse a gRPC channel across gRPC and Flight clients, though breaking the > hiding of gRPC isn't desirable. > > > > Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' > methods, so that we don't have to busy-wait ourselves (as in Java) or have > the option to not busy-wait taken away from us (as in C++). In particular, > when investigating Python bindings to C++ [0], the fact that > FlightServerBase::Run also calls grpc::Server::Wait for you means that Ctrl-C > no longer works in Python. > > > > Does what we're trying to accomplish make sense? Are there better ways to > achieve resumable uploads/downloads in the current protocol? > > > > [0]: https://github.com/apache/arrow/pull/3566 > > > > Thanks, > > David > >
Arrow Flight protocol/API questions
Hi all, We've been evaluating Flight for our use, and we're wondering if the protocol is still open to extensions, as having a few application-defined metadata fields would help our use cases a lot. (Apologies if this is a repost - was having issue with the spam filter.) Specifically, in DoGet, having a metadata binary blob in the server->client messages would help implement resumable requests, especially as we have non-monotonically-indexed data streams. This would also help us reuse server-side state if we do have to resume a stream. In DoPut, we think making this call bidirectional would be useful to support application-level ACKs, again to implement resumable uploads. The server would thus have the option to send back an application-defined binary blob at any point during an upload. This is less important, as you could imagine starting a plain gRPC server-streaming call alongside the Flight DoPut call to do the same. But as you can't bind a gRPC and Flight service on the same port/channel, this is somewhat inconvenient. That leads me to the API-level niggles we have; it would be nice to be able to bind gRPC services alongside a Flight service, and conversely be able to reuse a gRPC channel across gRPC and Flight clients, though breaking the hiding of gRPC isn't desirable. Meanwhile, it would be nice to wrap the gRPC server 'awaitTermination' methods, so that we don't have to busy-wait ourselves (as in Java) or have the option to not busy-wait taken away from us (as in C++). In particular, when investigating Python bindings to C++ [0], the fact that FlightServerBase::Run also calls grpc::Server::Wait for you means that Ctrl-C no longer works in Python. Does what we're trying to accomplish make sense? Are there better ways to achieve resumable uploads/downloads in the current protocol? [0]: https://github.com/apache/arrow/pull/3566 Thanks, David
[jira] [Created] (ARROW-4546) LICENSE.txt should be updated.
Renat Valiullin created ARROW-4546: -- Summary: LICENSE.txt should be updated. Key: ARROW-4546 URL: https://issues.apache.org/jira/browse/ARROW-4546 Project: Apache Arrow Issue Type: Task Reporter: Renat Valiullin parquet-cpp/blob/master/LICENSE.txt is not mentioned there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4545) [C#] Extend Append/AppendRange in BinaryArray to support building rows
Chris Hutchinson created ARROW-4545: --- Summary: [C#] Extend Append/AppendRange in BinaryArray to support building rows Key: ARROW-4545 URL: https://issues.apache.org/jira/browse/ARROW-4545 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Chris Hutchinson Fix For: 0.13.0 This is a proposal to extend BinaryArray to provide the ability to call Append/AppendRange to grow individual rows during array building, and to expose values in ArrowBuffer.Builder through a property to facilitate algorithms that require introspecting the buffer data (sorting, filtering) when building an array. *Example:* {code:java} var builder = new BinaryArray.Builder() .Append(10, false) .Append(20, false) .Mark(); builder.Append(builder.Values[0], true); var array = builder.Build(); // General idea: // // 1. Append byte (10) to current element (0) // 2. Append byte (20) to current element (0) // 3. Mark end of the row // 4. Append byte (10) to current element (1) // Constructs a binary array with 2 elements: // // [0] 10, 20 // [1] 10{code} This proposed change would add the concept of "current element" to the builder, which in the specification are separated by recording the value offset. Append(true) appends one or more bytes to the current element and then marks the element as completed. Append(false) appends one or more bytes to the current element; Mark is required to signal to the builder that the current element is complete. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Rust] Rust 0.13.0 release
Thanks for bringing this up Andy. I'm unemployed/on recovery leave, so I've had some surplus time to work on Rust. There's a lot of features that I've wanted to work on, some which I've spent some time attempting, but struggled with. A few block additional work that I could contribute. In 0.13.0 and the release thereafter: I'd like to see: Date/time support. I've spent a lot of time trying to implement this, but I get the feeling that my Rust isn't good enough yet to pull this together. More IO support. I'm working on JSON reader, and want to work on JSON and CSV (continuing where you left off) writers after this. With date/time support, I can also work on date/time parsing so we can have these in CSV and JSON. Parquet support isn't on my radar at the moment. JSON and CSV are more commonly used, so I'm hoping that with concrete support for these, more people using Rust can choose to integrate Arrow. That could bring us more hands to help. Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I tried working on it but failed. Related to this would be array chunking. I need these in order to be able to operate on "Tables" like CPP, Python and others. I've got ChunkedArray, Column and Table roughly implemented in my fork, but without zero-copy slicing, I can't upstream them. I've made good progress on scalar and array operations. I have trig functions, some string operators and other functions that one can run on a Spark-esque dataframe. These will fit in well with DataFusion's SQL operations, but from a decision-perspective, I think it would help if we join heads and think about the direction we want to take on compute. SIMD is great, and when Paddy's hashed out how it works, more of us will be able to contribute SIMD compatible compute operators. Thanks, Neville On Tue, 12 Feb 2019 at 18:12, Andy Grove wrote: > I was curious what our Rust committers and contributors are excited about > for 0.13.0. > > The feature I would most like to see is that ability for DataFusion to run > SQL against Parquet files again, as that would give me an excuse for a PoC > in my day job using Arrow. > > I know there were some efforts underway to build arrow array readers for > Parquet and it would make sense for me to help there. > > I would also like to start building out some benchmarks. > > I think the SIMD work is exciting too. > > I'd like to hear thoughts from everyone else though since we're all coming > at this from different perspectives. > > Thanks, > > Andy. >
[Rust] Rust 0.13.0 release
I was curious what our Rust committers and contributors are excited about for 0.13.0. The feature I would most like to see is that ability for DataFusion to run SQL against Parquet files again, as that would give me an excuse for a PoC in my day job using Arrow. I know there were some efforts underway to build arrow array readers for Parquet and it would make sense for me to help there. I would also like to start building out some benchmarks. I think the SIMD work is exciting too. I'd like to hear thoughts from everyone else though since we're all coming at this from different perspectives. Thanks, Andy.
[jira] [Created] (ARROW-4544) [Rust] Read nested JSON structs into StructArrays
Neville Dipale created ARROW-4544: - Summary: [Rust] Read nested JSON structs into StructArrays Key: ARROW-4544 URL: https://issues.apache.org/jira/browse/ARROW-4544 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale _Adding this as a separate task as it's a bit involved._ Add the ability to read in JSON structs that are children of the JSON record being read. The main concern here is deeply nested structures, which will require a performant and reusable basic JSON reader before dealing with recursion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4543) [C#] Update Flat Buffers code to latest version
Eric Erhardt created ARROW-4543: --- Summary: [C#] Update Flat Buffers code to latest version Key: ARROW-4543 URL: https://issues.apache.org/jira/browse/ARROW-4543 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt In order to support zero-copy reads, we should update to the latest Google Flat Buffers code. A recent change now allows [C# support for directly reading and writing to memory other than byte|https://github.com/google/flatbuffers/pull/4886][] which will make reading native memory using `Memory` possible. Along with this update, we should mark the flat buffers types as `internal`, since they are an implementation detail of the library. From an API perspective, it is confusing to see multiple public types named "Schema", "Field", "RecordBatch" etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4542) Denominate row group size in bytes (not in no of rows)
Remek Zajac created ARROW-4542: -- Summary: Denominate row group size in bytes (not in no of rows) Key: ARROW-4542 URL: https://issues.apache.org/jira/browse/ARROW-4542 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Remek Zajac Both the C++ [implementation of parquet writer for arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174] and the [Python code bound to it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911] appears denominated in the *number of rows* (without making it very explicit). Whereas: (1) [The Apache parquet documentation|https://parquet.apache.org/documentation/latest/] states: "_Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). *We recommend large row groups (512MB - 1GB)*. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file._" (2) Reference Apache [parquet-mr implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146] for Java accepts the row size expressed in bytes. (3) The [low-level parquet read-write example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88] also considers row group be denominated in bytes. These insights make me conclude that: * Per parquet design and to take advantage of HDFS block level operations, it only makes sense to work with row group sizes as expressed in bytes - as that the only consequential desire the caller can express and want to influence. * Arrow implementation of ParquetWriter would benefit from re-nominating its `row_group_size` into bytes. Now, my conclusions can be wrong and I may be blind to some alley of reasoning, so this ticket is more of a question than a bug. A question on whether the audience here agrees with my reasoning and if not - to explain what detail I have missed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4541) [Gandiva] Enable timestamp tests on windows platform
shyam narayan singh created ARROW-4541: -- Summary: [Gandiva] Enable timestamp tests on windows platform Key: ARROW-4541 URL: https://issues.apache.org/jira/browse/ARROW-4541 Project: Apache Arrow Issue Type: Improvement Reporter: shyam narayan singh As the timezone database is not available on windows operating system, the cast timestamp test cases that uses timezone apis are failing. Tests are currently disabled on windows platform. Need to find a way to test the timezone apis on windows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4540) [Rust] Add basic JSON reader
Neville Dipale created ARROW-4540: - Summary: [Rust] Add basic JSON reader Key: ARROW-4540 URL: https://issues.apache.org/jira/browse/ARROW-4540 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale This is the first step in getting a JSON reader working in Rust -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4539) [Java]List vector child value count not set correctly
Praveen Kumar Desabandu created ARROW-4539: -- Summary: [Java]List vector child value count not set correctly Key: ARROW-4539 URL: https://issues.apache.org/jira/browse/ARROW-4539 Project: Apache Arrow Issue Type: Task Reporter: Praveen Kumar Desabandu Assignee: Praveen Kumar Desabandu Fix For: 0.14.0 We are not correctly processing list vectors that could have null values. The child value count would be off there by losing data in variable width vectors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4538) pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset()
Christian Thiel created ARROW-4538: -- Summary: pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset() Key: ARROW-4538 URL: https://issues.apache.org/jira/browse/ARROW-4538 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Reporter: Christian Thiel When using {{pa.Table.from_pandas()}} with preserve_index=True and dataframe.index.name!=None the prefix {{__index_level_}} is not added to the respective schema name. This breaks {{write_to_dataset}} with active partition columns. {code:python} import pyarrow as pa import pyarrow.parquet as pq import os import shutil import pandas as pd import numpy as np PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df['arrays'] = pd.Series(arrays) df.index.name='ID' table = pa.Table.from_pandas(df, preserve_index=True) print(table.schema.names) pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'], preserve_index=True ) {code} Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in {{write_to_dataset}} works. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4537) [CI] Suppress shell warning on travis-ci
Kenta Murata created ARROW-4537: --- Summary: [CI] Suppress shell warning on travis-ci Key: ARROW-4537 URL: https://issues.apache.org/jira/browse/ARROW-4537 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Kenta Murata Suppress shell warnings like: {{+'[' == 1 ']'}} {{/home/travis/build/apache/arrow/ci/travis_before_script_cpp.sh: line 81: [: ==: unary operator expected}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)