Re: Java: DefaultVectorComparators - invalid implementation
Hi Martin, Thank you so much for reporting this problem. In the current implementation, we do not consider corner cases related to integer overflow, and this problem should be fixed. I have opened an issue to track this problem [1]. Do you want to provide a patch for it? Best, Liya Fan [1] https://issues.apache.org/jira/browse/ARROW-8392 On Fri, Apr 10, 2020 at 12:55 AM Martin Janda wrote: > I made first look to Apache Arrow Java sources. > > I found wrong implementation for DefaultVectorComparators.LongComparator > I suppose that other comparators can be wrong too. > > Simple test: > > long l1=Long.MIN_VALUE +1L; > long l2=Long.MAX_VALUE; > > System.out.println("Arrow: " + Long.signum(l1 - l2)); > System.out.println("Java : " + Long.compare(l1, l2)); > > Result: > Arrow: 1 > Java : -1 > > I think that there should be added tests for corner cases. And I suggest > to replace arrow implementation with Long.compare, Integer.compare... > that provides correct results. >Thank you > Martin > > PS I have no access to JIRA to report this issue. > >
[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison
Liya Fan created ARROW-8392: --- Summary: [Java] Fix overflow related corner cases for vector value comparison Key: ARROW-8392 URL: https://issues.apache.org/jira/browse/ARROW-8392 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan 1. Fix corner cases related to overflow. 2. Provide test cases for the corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8391) [C++] Implement row range read API for IPC file (and Feather)
Wes McKinney created ARROW-8391: --- Summary: [C++] Implement row range read API for IPC file (and Feather) Key: ARROW-8391 URL: https://issues.apache.org/jira/browse/ARROW-8391 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney The objective would be able to read a range of rows from the middle of a file. It's not as easy as it might sound since all the record batch metadata must be examined to determine the start and end point of the row range -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: mutual TLS auth support with arrow flight
Hey David, This isn't exposed right now. You'd have to expose the gRPC option on the client and server sides; right now while Flight does set up SSL credentials when TLS is enabled, it's only to allow you to set the root certificate on the client [1] and the server certificate [2]. There is already support for custom authentication methods, though, if you're amenable to something other than mTLS (e.g. username/password or auth token). If you're interested in contributing this, I think it should be fairly straightforward - you'd just need to add options that get passed through to gRPC - though you'd have to also expose the option to Python. For Java servers, you can use the flight-grpc artifact [3] to obtain a "plain" gRPC service from your Flight Producer implementation, which you can then attach to a gRPC server that you've configured with mTLS. Unfortunately this convenience isn't (reasonably) possible to implement in Python with the way that gRPC-C++ and gRPC-Python are designed. Best, David [1]: https://github.com/apache/arrow/blob/2914899326d50d3e2853f5ecbd165028d862378f/cpp/src/arrow/flight/client.cc#L538-L542 [2]: https://github.com/apache/arrow/blob/2914899326d50d3e2853f5ecbd165028d862378f/cpp/src/arrow/flight/server.cc#L674-L678 [3]: https://search.maven.org/search?q=g:org.apache.arrow%20AND%20a:flight-grpc On 4/9/20, David Seapy wrote: > grpc supports connections using mutual TLS with client and server > certificates. Is there an example of how to do this with arrow flight > libraries, or does one need to step down to the grpc-level when making > requests? > > Specifically we are working on having data-scientists establish a > connection with our scala flight server from their python client. If it > is not currently supported but is a feature that the community would > benefit from, then maybe we can take a stab at adding support for this. > > Any advice or pointers would be appreciated. > > Thanks! > > - David Seapy > >
[jira] [Created] (ARROW-8390) [R] Expose schema unification features
Neal Richardson created ARROW-8390: -- Summary: [R] Expose schema unification features Key: ARROW-8390 URL: https://issues.apache.org/jira/browse/ARROW-8390 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
mutual TLS auth support with arrow flight
grpc supports connections using mutual TLS with client and server certificates. Is there an example of how to do this with arrow flight libraries, or does one need to step down to the grpc-level when making requests? Specifically we are working on having data-scientists establish a connection with our scala flight server from their python client. If it is not currently supported but is a feature that the community would benefit from, then maybe we can take a stab at adding support for this. Any advice or pointers would be appreciated. Thanks! - David Seapy
Re: [Rust] Dictionary encoding and Flight
Well, luckily we have some newly spruced up documentation about how integration testing works (thanks Neal!) https://github.com/apache/arrow/blob/master/docs/source/format/Integration.rst The main task is writing a parser for the JSON format used for integration testing. The JSON is used to communicate the "truth" about the data between two implementations. There are implementations of this in C++, Go, JS, and Java, so any of those implementations could be used as your guide. One small reminder for others reading: the current JSON integration testing format doesn't yet capture dictionary deltas/replacements. See ARROW-5338 On Thu, Apr 9, 2020 at 5:55 PM Paul Dix wrote: > > I'd be happy to pitch in on getting the integration tests developed. It > would certainly beat my current method of building and running my test > project and switching over to a Jupyter notebook to manually check it. > > Is there any prior work in the Rust project that I could basically copy > from? Or perhaps the C++ implementation? > > On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney wrote: > > > hi Paul, > > > > Dictionary-encoded is not a nested type, so there shouldn't be any > > children -- the IPC layout of a dictionary encoded field is that same > > as the type of the indices (probably want to change the terminology in > > the Rust library from "keys" to "indices" which is what's used in the > > specification). And the dictionaries are sent separately in record > > batches > > > > > The Rust implementation of DictionaryArray keeps the values as a child > > to the keys array. > > > > What we do in C++ is that the dictionary is a separate member of the > > ArrayData data structure. I don't know enough about the Rust library > > to know whether that's the right design choice or not. > > > > Implementing Rust support for integration tests would make it much > > easier to validate that things are implemented correctly. Does anyone > > know how far away the Rust library might be from having them? It's not > > the most fun thing to implement, but it's very important. > > > > - Wes > > > > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix wrote: > > > > > > I managed to get something up and running. I ended up creating a > > > dictionary_batch.rs and adding that to convert.rs to translate > > dictionary > > > fields in a schema over to the correct fb thing. I also added a method to > > > writer.rs to convert that to bytes so it can be sent via ipc. However, > > when > > > writing out the subsequent record batches that contain a dictionary field > > > that references the dictionary sent in the dictionary batch, it runs > > into a > > > problem. The Rust implementation of DictionaryArray keeps the values as a > > > child to the keys array. > > > > > > You can see the chain of calls when calling finish on the > > > StringDictionaryBuilder > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 > > > which calls out to finish the dictionary > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 > > > which then calls from to create the DictionaryArray > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 > > > And that last part validates that the child is there. > > > > > > And here in the writer is where it blindly writes out any children: > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 > > > > > > I found that when I sent this out, the Python reader on the other side > > > would choke on it when I tried to read_pandas. However, if I skip > > > serializing the children for DictionaryArray only, then everything works > > > exactly as expected. > > > > > > So I'm not sure what the right thing here is because I don't know if > > > DictionaryArrays should not include the children (i.e. values) when being > > > written out or if there's some other thing that should be happening. > > Should > > > the raw arrow data have the values in that dictionary? Or does it get > > > reconstituted manually on the other side from the DictionaryBatch? > > > > > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney > > wrote: > > > > > > > As another item for consideration -- in C++ at least, the dictionary > > > > id is dealt with as an internal detail of the IPC message production > > > > process. When serializing the Schema, id's are assigned to each > > > > dictionary-encoded field in the DictionaryMemo object, see > > > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > > > > > > > When record batches are reconstructed, the dictionary corresponding to > > > > an id at the time of reconstruction is set in the Array's internal > > > > data -- that's the "dictionary" member of the ArrayData object > > > > ( > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > > > > > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney > > wrote: > > > > >
Re: [Rust] Dictionary encoding and Flight
I'd be happy to pitch in on getting the integration tests developed. It would certainly beat my current method of building and running my test project and switching over to a Jupyter notebook to manually check it. Is there any prior work in the Rust project that I could basically copy from? Or perhaps the C++ implementation? On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney wrote: > hi Paul, > > Dictionary-encoded is not a nested type, so there shouldn't be any > children -- the IPC layout of a dictionary encoded field is that same > as the type of the indices (probably want to change the terminology in > the Rust library from "keys" to "indices" which is what's used in the > specification). And the dictionaries are sent separately in record > batches > > > The Rust implementation of DictionaryArray keeps the values as a child > to the keys array. > > What we do in C++ is that the dictionary is a separate member of the > ArrayData data structure. I don't know enough about the Rust library > to know whether that's the right design choice or not. > > Implementing Rust support for integration tests would make it much > easier to validate that things are implemented correctly. Does anyone > know how far away the Rust library might be from having them? It's not > the most fun thing to implement, but it's very important. > > - Wes > > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix wrote: > > > > I managed to get something up and running. I ended up creating a > > dictionary_batch.rs and adding that to convert.rs to translate > dictionary > > fields in a schema over to the correct fb thing. I also added a method to > > writer.rs to convert that to bytes so it can be sent via ipc. However, > when > > writing out the subsequent record batches that contain a dictionary field > > that references the dictionary sent in the dictionary batch, it runs > into a > > problem. The Rust implementation of DictionaryArray keeps the values as a > > child to the keys array. > > > > You can see the chain of calls when calling finish on the > > StringDictionaryBuilder > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 > > which calls out to finish the dictionary > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 > > which then calls from to create the DictionaryArray > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 > > And that last part validates that the child is there. > > > > And here in the writer is where it blindly writes out any children: > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 > > > > I found that when I sent this out, the Python reader on the other side > > would choke on it when I tried to read_pandas. However, if I skip > > serializing the children for DictionaryArray only, then everything works > > exactly as expected. > > > > So I'm not sure what the right thing here is because I don't know if > > DictionaryArrays should not include the children (i.e. values) when being > > written out or if there's some other thing that should be happening. > Should > > the raw arrow data have the values in that dictionary? Or does it get > > reconstituted manually on the other side from the DictionaryBatch? > > > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney > wrote: > > > > > As another item for consideration -- in C++ at least, the dictionary > > > id is dealt with as an internal detail of the IPC message production > > > process. When serializing the Schema, id's are assigned to each > > > dictionary-encoded field in the DictionaryMemo object, see > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > > > > > When record batches are reconstructed, the dictionary corresponding to > > > an id at the time of reconstruction is set in the Array's internal > > > data -- that's the "dictionary" member of the ArrayData object > > > ( > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > > > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney > wrote: > > > > > > > > hey Paul, > > > > > > > > Take a look at how dictionaries work in the IPC protocol > > > > > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc > > > > > > > > Dictionaries are sent as separate messages. When a field is tagged as > > > > dictionary encoded in the schema, the IPC reader must keep track of > > > > the dictionaries it's seen come across the protocol and then set them > > > > in the reconstructed record batches when a record batch comes > through. > > > > > > > > Note that the protocol now supports dictionary deltas (dictionaries > > > > can be appended to by subsequent messages for the same dictionary id) > > > > and replacements (new dictionary for an id). > > > > > > > > I don't know what the status of handling dictionaries in the Rust > IPC, > > >
[jira] [Created] (ARROW-8389) [Integration] Run tests in parallel
Antoine Pitrou created ARROW-8389: - Summary: [Integration] Run tests in parallel Key: ARROW-8389 URL: https://issues.apache.org/jira/browse/ARROW-8389 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Integration Reporter: Antoine Pitrou This follows ARROW-8176. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Rust] Dictionary encoding and Flight
hi Paul, Dictionary-encoded is not a nested type, so there shouldn't be any children -- the IPC layout of a dictionary encoded field is that same as the type of the indices (probably want to change the terminology in the Rust library from "keys" to "indices" which is what's used in the specification). And the dictionaries are sent separately in record batches > The Rust implementation of DictionaryArray keeps the values as a child to the > keys array. What we do in C++ is that the dictionary is a separate member of the ArrayData data structure. I don't know enough about the Rust library to know whether that's the right design choice or not. Implementing Rust support for integration tests would make it much easier to validate that things are implemented correctly. Does anyone know how far away the Rust library might be from having them? It's not the most fun thing to implement, but it's very important. - Wes On Thu, Apr 9, 2020 at 5:08 PM Paul Dix wrote: > > I managed to get something up and running. I ended up creating a > dictionary_batch.rs and adding that to convert.rs to translate dictionary > fields in a schema over to the correct fb thing. I also added a method to > writer.rs to convert that to bytes so it can be sent via ipc. However, when > writing out the subsequent record batches that contain a dictionary field > that references the dictionary sent in the dictionary batch, it runs into a > problem. The Rust implementation of DictionaryArray keeps the values as a > child to the keys array. > > You can see the chain of calls when calling finish on the > StringDictionaryBuilder > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 > which calls out to finish the dictionary > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 > which then calls from to create the DictionaryArray > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 > And that last part validates that the child is there. > > And here in the writer is where it blindly writes out any children: > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 > > I found that when I sent this out, the Python reader on the other side > would choke on it when I tried to read_pandas. However, if I skip > serializing the children for DictionaryArray only, then everything works > exactly as expected. > > So I'm not sure what the right thing here is because I don't know if > DictionaryArrays should not include the children (i.e. values) when being > written out or if there's some other thing that should be happening. Should > the raw arrow data have the values in that dictionary? Or does it get > reconstituted manually on the other side from the DictionaryBatch? > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney wrote: > > > As another item for consideration -- in C++ at least, the dictionary > > id is dealt with as an internal detail of the IPC message production > > process. When serializing the Schema, id's are assigned to each > > dictionary-encoded field in the DictionaryMemo object, see > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > > > When record batches are reconstructed, the dictionary corresponding to > > an id at the time of reconstruction is set in the Array's internal > > data -- that's the "dictionary" member of the ArrayData object > > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney wrote: > > > > > > hey Paul, > > > > > > Take a look at how dictionaries work in the IPC protocol > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc > > > > > > Dictionaries are sent as separate messages. When a field is tagged as > > > dictionary encoded in the schema, the IPC reader must keep track of > > > the dictionaries it's seen come across the protocol and then set them > > > in the reconstructed record batches when a record batch comes through. > > > > > > Note that the protocol now supports dictionary deltas (dictionaries > > > can be appended to by subsequent messages for the same dictionary id) > > > and replacements (new dictionary for an id). > > > > > > I don't know what the status of handling dictionaries in the Rust IPC, > > > but it would be a good idea to take time to take into account the > > > above details. > > > > > > Finally, note that Rust is not participating in either the regular IPC > > > nor Flight integration tests. This is an important milestone to being > > > able to depend on the Rust library in production. > > > > > > Thanks > > > Wes > > > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix wrote: > > > > > > > > Hello, > > > > I'm trying to build a Rust based Flight server and I'd like to use > > > > Dictionary encoding for a number of string columns in my data. I've > > seen > > > >
Re: [Rust] Dictionary encoding and Flight
I managed to get something up and running. I ended up creating a dictionary_batch.rs and adding that to convert.rs to translate dictionary fields in a schema over to the correct fb thing. I also added a method to writer.rs to convert that to bytes so it can be sent via ipc. However, when writing out the subsequent record batches that contain a dictionary field that references the dictionary sent in the dictionary batch, it runs into a problem. The Rust implementation of DictionaryArray keeps the values as a child to the keys array. You can see the chain of calls when calling finish on the StringDictionaryBuilder https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 which calls out to finish the dictionary https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 which then calls from to create the DictionaryArray https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 And that last part validates that the child is there. And here in the writer is where it blindly writes out any children: https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 I found that when I sent this out, the Python reader on the other side would choke on it when I tried to read_pandas. However, if I skip serializing the children for DictionaryArray only, then everything works exactly as expected. So I'm not sure what the right thing here is because I don't know if DictionaryArrays should not include the children (i.e. values) when being written out or if there's some other thing that should be happening. Should the raw arrow data have the values in that dictionary? Or does it get reconstituted manually on the other side from the DictionaryBatch? On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney wrote: > As another item for consideration -- in C++ at least, the dictionary > id is dealt with as an internal detail of the IPC message production > process. When serializing the Schema, id's are assigned to each > dictionary-encoded field in the DictionaryMemo object, see > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > When record batches are reconstructed, the dictionary corresponding to > an id at the time of reconstruction is set in the Array's internal > data -- that's the "dictionary" member of the ArrayData object > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney wrote: > > > > hey Paul, > > > > Take a look at how dictionaries work in the IPC protocol > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc > > > > Dictionaries are sent as separate messages. When a field is tagged as > > dictionary encoded in the schema, the IPC reader must keep track of > > the dictionaries it's seen come across the protocol and then set them > > in the reconstructed record batches when a record batch comes through. > > > > Note that the protocol now supports dictionary deltas (dictionaries > > can be appended to by subsequent messages for the same dictionary id) > > and replacements (new dictionary for an id). > > > > I don't know what the status of handling dictionaries in the Rust IPC, > > but it would be a good idea to take time to take into account the > > above details. > > > > Finally, note that Rust is not participating in either the regular IPC > > nor Flight integration tests. This is an important milestone to being > > able to depend on the Rust library in production. > > > > Thanks > > Wes > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix wrote: > > > > > > Hello, > > > I'm trying to build a Rust based Flight server and I'd like to use > > > Dictionary encoding for a number of string columns in my data. I've > seen > > > that StringDictionary was recently added to Rust here: > > > > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab > . > > > > > > However, that doesn't seem to reach down into Flight. When I attempt to > > > send a schema through flight that has a Dictionary it > throws > > > an error when attempting to convert from the Rust type to the > Flatbuffer > > > field type. I figured I'd take a swing at adding that to convert.rs > here: > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319 > > > > > > However, when I look at the definitions in Schema.fbs and the related > > > generated Rust file, Dictionary isn't a type there. Should I be sending > > > this down as some other composed type? And if so, how does this look > at the > > > client side of things? In my test I'm connecting to the Flight server > via > > > PyArrow and working with it in Pandas so I'm hoping that it will be > able to > > > consume Dictionary fields. > > > > > > Separately, the Rust field type doesn't have a spot for the dictionary > ID, > > > which I as
[jira] [Created] (ARROW-8388) [C++] GCC 4.8 fails to move on return
Ben Kietzman created ARROW-8388: --- Summary: [C++] GCC 4.8 fails to move on return Key: ARROW-8388 URL: https://issues.apache.org/jira/browse/ARROW-8388 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 0.17.0 See https://github.com/apache/arrow/pull/6883#issuecomment-611661733 This is a recurring problem which usually shows up as a broken nightly (the gandiva nightly jobs, specifically) along with similar issues due to gcc 4.8's incomplete handling of c++11. As long as someone depends on these we should probably have an every-commit CI job which checks we haven't introduced such a breakage -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8387) [rust] Make schema_to_fb public because it is very useful!
Max Burke created ARROW-8387: Summary: [rust] Make schema_to_fb public because it is very useful! Key: ARROW-8387 URL: https://issues.apache.org/jira/browse/ARROW-8387 Project: Apache Arrow Issue Type: Improvement Reporter: Max Burke Make schema_to_fb public because it is very useful! -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Python] black vs. autopep8
> > So autopep8 doesn't fix everything? Sounds inferior to me. That said, I'm > > in favor of any resolution that increases our automation of this and > > decreases the energy we expend debating it. > > It does fix everything, where "everything" is compliance with PEP8, > which I think is the thing we are most interested in. > Note that autopep8 does not fix everything, even for PEP8 compliance. At least, it can often not automatically fix "too long line length" for code lines (that was actually the trigger for me to start the thread about adopting black).
Java: DefaultVectorComparators - invalid implementation
I made first look to Apache Arrow Java sources. I found wrong implementation for DefaultVectorComparators.LongComparator I suppose that other comparators can be wrong too. Simple test: long l1=Long.MIN_VALUE +1L; long l2=Long.MAX_VALUE; System.out.println("Arrow: " + Long.signum(l1 - l2)); System.out.println("Java : " + Long.compare(l1, l2)); Result: Arrow: 1 Java : -1 I think that there should be added tests for corner cases. And I suggest to replace arrow implementation with Long.compare, Integer.compare... that provides correct results. Thank you Martin PS I have no access to JIRA to report this issue.
Re: [Python] black vs. autopep8
+1 for autopep8 On Thu, Apr 9, 2020 at 4:45 PM Wes McKinney wrote: > So to summarize, it seems that what we are agreeing in this thread is > to not debate readability about otherwise PEP8-compliant Python code > in code reviews, is that right? Absent a consensus about a change, the > outcome is basically "no change". The addition of autopep8 as a tool > for automated PEP8 fixes (as opposed to manual ones, which is what's > required right now) seems reasonable in that it does not make invasive > changes to the current codebase. > > We can also abandon this discussion and not use any tool. Either way > it would be good to reach a conclusion and move on. > > - Wes > > On Thu, Apr 9, 2020 at 3:05 AM Uwe L. Korn wrote: > > > > The non-configurability of black is one of the strongest arguments I see > for black. The codestyle will always be subjective. From previous > discussions I know that my personal preference of readability conflicts > with that of Antoine and Wes, so will probably others. We have the same > issue with using the Google C++ style guide in C++. Not everyone agrees > with it but at least it is a guide that is adopted widely (and thus seen in > other projects quite often) and has well outlined reasonings for most of > its choices. > > > > On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote: > > > Another option that we can look into is yapf ( > > > https://github.com/google/yapf). It is similar to black but more > tweakable. > > > Also, it is recently adopted by the Apache Beam project. PR is here > > > https://github.com/apache/beam/pull/10684/files > > > > > > Bin > > > > > > On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney > wrote: > > > > > > > I don't think it's possible unfortunately. From the README: "Black > > > > reformats entire files in place. It is not configurable." > > > > > > > > The main concern about Black is the impact that it has on > readability. > > > > I share this concern as the subjective style choices it makes are > > > > quite different from the way I've been writing Python code the last > 12 > > > > years. That doesn't mean I'm right and it's wrong, of course. In this > > > > project, it doesn't seem like we're losing much energy to people > > > > reformatting arguments lists or adding line breaks here and there, > > > > etc. FWIW, from reading the Black docs, it doesn't strike me that the > > > > tool is designed to maximize readability, but rather to make > > > > formatting deterministic and reduce code diffs caused by > reformatting. > > > > > > > > My opinion is that we should simply avoid debating code style in code > > > > reviews as long as the code passes the PEP8 checks. Employing > autopep8 > > > > is an improvement over the status quo (which is that developers must > > > > fix flake8 / PEP8 warnings by hand). We can always revisit the Black > > > > discussion in the future if we find that subjective code formatting > > > > (beyond PEP8 compliance) is taking up our energy inappropriately. > > > > > > > > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc > wrote: > > > > > > > > > > Could we 'tone down' black to get the desired behavior? > > > > > I'm ok with either tool. > > > > > > > > > > Rok > > > > > > > > > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney > wrote: > > > > > > > > > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson > > > > > > wrote: > > > > > > > > > > > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That > > > > said, I'm > > > > > > > in favor of any resolution that increases our automation of > this and > > > > > > > decreases the energy we expend debating it. > > > > > > > > > > > > It does fix everything, where "everything" is compliance with > PEP8, > > > > > > which I think is the thing we are most interested in. > > > > > > > > > > > > Black makes a bunch of other arbitrary (albeit consistent) > > > > > > reformattings that don't affect PEP8 compliance. > > > > > > > > > > > > > Neal > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney < > wesmck...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Circling back on this, it seems there isn't consensus about > > > > switching > > > > > > > > to Black, and using autopep8 at least will give us an easy > way to > > > > > > > > maintain PEP8 compliance and help contributors fix linting > failures > > > > > > > > detected by flake8 (but not all, e.g. unused imports would > need to > > > > be > > > > > > > > manually removed). Would everyone be on board with using > autopep8? > > > > > > > > > > > > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney < > wesmck...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > I'm personally fine with the Black changes. After the > one-time > > > > cost > > > > > > of > > > > > > > > > reformatting the codebase, it will take any personal > preferences > > > > out > > > > > > > > > of code formatting (I admit that I have several myself, > but I > > > > don't > > > > > > > > > mi
[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays
Bryan Cutler created ARROW-8386: --- Summary: [Python] pyarrow.jvm raises error for empty Arrays Key: ARROW-8386 URL: https://issues.apache.org/jira/browse/ARROW-8386 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Bryan Cutler Assignee: Bryan Cutler In the pyarrow.jvm module, when there is an empty array in Java, trying to create it in python raises a ValueError. This is because for an empty array, Java returns an empty list of buffers, then pyarrow.jvm attempts to create the array with pa.Array.from_buffers with an empty list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8385) Crash on parquet.read_table on windows python 3.82
Geoff Quested-Joens created ARROW-8385: -- Summary: Crash on parquet.read_table on windows python 3.82 Key: ARROW-8385 URL: https://issues.apache.org/jira/browse/ARROW-8385 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: Window 10 python 3.8.2 pip 20.0.2 pip freeze -> numpy==1.18.2 pandas==1.0.3 pyarrow==0.16.0 python-dateutil==2.8.1 pytz==2019.3 six==1.14.0 Reporter: Geoff Quested-Joens Attachments: crash.parquet On read of parquet file using pyarrow the program spontaneously exits no thrown exceptions windows only. Testing the same setup on linux (debian 10 in a Docker) reading the same parquet file is done without issue. The follow can reproduce the crash in a python 3.8.2 environment env listed bellow but is essentially pip install pandas and pyarrow. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def test_pandas_write_read(): df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)]) df_out.to_parquet("crash.parquet") df_in = pd.read_parquet("crash.parquet") print(df_in) def test_arrow_write_read(): df = pd.DataFrame.from_dict([{"A":i} for i in range(3)]) table_out = pa.Table.from_pandas(df) pq.write_table(table_out, 'crash.parquet') table_in = pq.read_table('crash.parquet') print(table_in) if _name_ == "_main_": test_pandas_write_read() test_arrow_write_read() {code} The interpreter never reaches the print statements crashing somewhere in the call on line 252 of {{parquet.py}} no error is thrown just spontaneous program exit. {code:python} self.reader.read_all(... {code} In contrast running the same code and python environment in debian 10 there is no error reading the parquet files generated by the same windows code. The sha2sum compare equal for the crash.parquet generated running on debian and windows so something appears to be up with the read. Attached is the crash.parquet file generated on my machine. Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on windows. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8384) [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket
Wes McKinney created ARROW-8384: --- Summary: [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket Key: ARROW-8384 URL: https://issues.apache.org/jira/browse/ARROW-8384 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney This feature seems to have been dropped Is there a plan for migrating users to the new filesystem API? We have two different code paths now -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Python] black vs. autopep8
So to summarize, it seems that what we are agreeing in this thread is to not debate readability about otherwise PEP8-compliant Python code in code reviews, is that right? Absent a consensus about a change, the outcome is basically "no change". The addition of autopep8 as a tool for automated PEP8 fixes (as opposed to manual ones, which is what's required right now) seems reasonable in that it does not make invasive changes to the current codebase. We can also abandon this discussion and not use any tool. Either way it would be good to reach a conclusion and move on. - Wes On Thu, Apr 9, 2020 at 3:05 AM Uwe L. Korn wrote: > > The non-configurability of black is one of the strongest arguments I see for > black. The codestyle will always be subjective. From previous discussions I > know that my personal preference of readability conflicts with that of > Antoine and Wes, so will probably others. We have the same issue with using > the Google C++ style guide in C++. Not everyone agrees with it but at least > it is a guide that is adopted widely (and thus seen in other projects quite > often) and has well outlined reasonings for most of its choices. > > On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote: > > Another option that we can look into is yapf ( > > https://github.com/google/yapf). It is similar to black but more tweakable. > > Also, it is recently adopted by the Apache Beam project. PR is here > > https://github.com/apache/beam/pull/10684/files > > > > Bin > > > > On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney wrote: > > > > > I don't think it's possible unfortunately. From the README: "Black > > > reformats entire files in place. It is not configurable." > > > > > > The main concern about Black is the impact that it has on readability. > > > I share this concern as the subjective style choices it makes are > > > quite different from the way I've been writing Python code the last 12 > > > years. That doesn't mean I'm right and it's wrong, of course. In this > > > project, it doesn't seem like we're losing much energy to people > > > reformatting arguments lists or adding line breaks here and there, > > > etc. FWIW, from reading the Black docs, it doesn't strike me that the > > > tool is designed to maximize readability, but rather to make > > > formatting deterministic and reduce code diffs caused by reformatting. > > > > > > My opinion is that we should simply avoid debating code style in code > > > reviews as long as the code passes the PEP8 checks. Employing autopep8 > > > is an improvement over the status quo (which is that developers must > > > fix flake8 / PEP8 warnings by hand). We can always revisit the Black > > > discussion in the future if we find that subjective code formatting > > > (beyond PEP8 compliance) is taking up our energy inappropriately. > > > > > > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc wrote: > > > > > > > > Could we 'tone down' black to get the desired behavior? > > > > I'm ok with either tool. > > > > > > > > Rok > > > > > > > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney wrote: > > > > > > > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson > > > > > wrote: > > > > > > > > > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That > > > said, I'm > > > > > > in favor of any resolution that increases our automation of this and > > > > > > decreases the energy we expend debating it. > > > > > > > > > > It does fix everything, where "everything" is compliance with PEP8, > > > > > which I think is the thing we are most interested in. > > > > > > > > > > Black makes a bunch of other arbitrary (albeit consistent) > > > > > reformattings that don't affect PEP8 compliance. > > > > > > > > > > > Neal > > > > > > > > > > > > > > > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney > > > > > wrote: > > > > > > > > > > > > > Circling back on this, it seems there isn't consensus about > > > switching > > > > > > > to Black, and using autopep8 at least will give us an easy way to > > > > > > > maintain PEP8 compliance and help contributors fix linting > > > > > > > failures > > > > > > > detected by flake8 (but not all, e.g. unused imports would need to > > > be > > > > > > > manually removed). Would everyone be on board with using autopep8? > > > > > > > > > > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney > > > > > wrote: > > > > > > > > > > > > > > > > I'm personally fine with the Black changes. After the one-time > > > cost > > > > > of > > > > > > > > reformatting the codebase, it will take any personal preferences > > > out > > > > > > > > of code formatting (I admit that I have several myself, but I > > > don't > > > > > > > > mind the normalization provided by Black). I hope that Cython > > > support > > > > > > > > comes soon since a great deal of our code is Cython > > > > > > > > > > > > > > > > On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka < > > > > > jacek.plis...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi! > >
[jira] [Created] (ARROW-8383) [RUST] Easier random access to DictionaryArray keys and values
Jörn Horstmann created ARROW-8383: - Summary: [RUST] Easier random access to DictionaryArray keys and values Key: ARROW-8383 URL: https://issues.apache.org/jira/browse/ARROW-8383 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jörn Horstmann Currently it's not that clear how to acces DictionaryArray keys and values using random indices. The `DictionaryArray::keys` method exposes an Iterator with an `nth` method, but this requires a mut reference and feels a little bit out of place compared to other methods of accessing arrow data. Another alternative seems to be to use the `From for PrimitiveArray` conversion like so `let keys : Int16Array = dictionary_array.data().into()`. This seems to work fine but is not easily discoverable and also needs to be done outside of any loops for performance reasons. I'd like methods on `DictionaryArray` to directly get the key at some index ``` pub fn key(&self, i: usize) -> &K ``` Ideally I'd also like an easier way to directly access values at some index, at least when those are primitive or string types. ``` pub fn value(&self, i: usize) -> &T ``` I'm not sure how or if that would be possible to implement with rust generics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes
Francois Saint-Jacques created ARROW-8382: - Summary: [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes Key: ARROW-8382 URL: https://issues.apache.org/jira/browse/ARROW-8382 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques WritePlan should look like the following. {code:c++} class ARROW_DS_EXPORT WritePlan { public: /// Execute the WritePlan and return a FileSystemDataset as a result. Result Execute(); protected: /// The schema of the Dataset which will be written std::shared_ptr schema; /// The format into which fragments will be written std::shared_ptr format; using SourceAndReader = std::pair; /// std::vector outputs; }; {code} * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema. * Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path. And remove any Write/Fragment logic from partition.cc. * Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement. * Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should there. The effects are: * Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource. * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes. * Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list. * Should be simpler to test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [C++] Compute: Datum and "ChunkedArray&" inputs
On Thu, Apr 9, 2020, 5:25 AM Antoine Pitrou wrote: > > It seems there are two different concerns here: > - the kernels' public API > - the ease with which kernels can be implemented. > > If we need a different public API, then IMO we should change it sooner > rather than later. > Yes, based on the problems I've listed I think some significant time needs to be invested in this after 0.17.0 is released. I can allocate some of my own time to it. As for the implementation, perhaps we should start by drawing > the main categories of kernels. For example: > - item-wise kernels (and/or vector-wise) > - aggregate kernels > > Some kernels will not fall in these categories (e.g. filter, join...), > but many of them do. > Yes, it's certainly important to distinguish been "operators" (which are data flow processing nodes like joins, aggregations, filters, projections) and functions / kernels (which perform granular units of work on concrete batches of data). Some operators can be used in "one-shot" mode (eg obtaining a result from a static unit of data) though. Regards > > Antoine. > > > On Wed, 8 Apr 2020 16:04:30 -0500 > Wes McKinney wrote: > > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney wrote: > > > > > > Another idea would be to have a variant with const-references instead > > > of shared_ptr. One potential issue with our Datum is that it plays the > > > dual role of transporting both input and output arguments. With > > > outputs it's necessary to be able to convey ownership while with > > > inputs this is less important. > > > > > > In general, I would say that some additional thought is needed about > > > our kernel APIs. Some scattered thoughts about this: > > > > > > * Kernel "binding" (selecting a kernel given input types) is a bit ad > > > hoc. Many kernels do not have the ability to tell you up front what > > > type of data will be returned (this is very important in the context > > > of a query engine runtime) > > > > Another observation is that many of our kernels are unable to write > > into preallocated memory. This is also a design issue that must be > > addressed (kernels having the same output type invoked in succession > > should be able to elide temporary allocations by writing into the same > > output memory) > > > > > * It's unclear that having a large collection of > > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable. > > > > > > Rather, it might be better to have something like: > > > > > > // We know the "schema" of each argument but don't have any data yet > > > std::string kernel_name = "$name"; > > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, > options); > > > auto out = bound_kernel->Call({arg0, arg1}); > > > > > > So here it would be > > > > > > ARROW_ASSIGN_OR_RAISE(auto take_kernel, > > > GetKernel("take", {shapes::chunked_array(int8()), > > > > shapes::chunked_array(int8())})); > > > > > > So now take_kernel->shape() should tell you that the expected output > > > is shapes::chunked_array(int8()) > > > > > > And Call should preferably accept reference arguments for its input > parameters. > > > > > > As far as kernel-specific options, we could create a variant that > > > includes the different kernel-specific options structures, so that > > > it's not necessary to have different dispatch functions for different > > > kernels. > > > > > > Anyway, just some out loud thoughts, but while we are in this > > > intermediate experimental state I'm supportive of you making the > > > decision that is most convenient for the immediate problem you want to > > > solve with the caveat that things may be refactored significantly in > > > the proximate future. > > > > > > - Wes > > > > > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn > wrote: > > > > > > > > I did a bit more research on JIRA and we seem to have this open > topic there also in https://issues.apache.org/jira/browse/ARROW-6959 > which is the similar topic as my mail is about and in > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some > of the interfaces with reference-types. > > > > > > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote: > > > > > Hello all, > > > > > > > > > > I'm in the progress of changing the implementation of the Take > kernel > > > > > to work on ChunkedArrays without concatenating them into a single > Array > > > > > first. While working on the implementation, I realised that we > switch > > > > > often between Datum and the specific-typed parameters. This works > quite > > > > > well for the combination of Array& and > Datum(shared_ptr) as > > > > > here the reference object with type Array& always carries a shared > > > > > reference with it, so switching between Array& and its Datum is > quite > > > > > easy. > > > > > > > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum > > > > > requires a shared_ptr which cannot be constructed > from > > > > > the reference type. Thus to allow
[jira] [Created] (ARROW-8381) [C++][Dataset] Dataset writing should require a writer schema
Francois Saint-Jacques created ARROW-8381: - Summary: [C++][Dataset] Dataset writing should require a writer schema Key: ARROW-8381 URL: https://issues.apache.org/jira/browse/ARROW-8381 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Francois Saint-Jacques # Dataset writing should always take an explicit writer schema instead of the first fragment's schema. # The MakeWritePlanImpl should not try removing columns that are found in the partition, this is left to the caller by passing an explicit schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [C++] Compute: Datum and "ChunkedArray&" inputs
It seems there are two different concerns here: - the kernels' public API - the ease with which kernels can be implemented. If we need a different public API, then IMO we should change it sooner rather than later. As for the implementation, perhaps we should start by drawing the main categories of kernels. For example: - item-wise kernels (and/or vector-wise) - aggregate kernels Some kernels will not fall in these categories (e.g. filter, join...), but many of them do. Regards Antoine. On Wed, 8 Apr 2020 16:04:30 -0500 Wes McKinney wrote: > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney wrote: > > > > Another idea would be to have a variant with const-references instead > > of shared_ptr. One potential issue with our Datum is that it plays the > > dual role of transporting both input and output arguments. With > > outputs it's necessary to be able to convey ownership while with > > inputs this is less important. > > > > In general, I would say that some additional thought is needed about > > our kernel APIs. Some scattered thoughts about this: > > > > * Kernel "binding" (selecting a kernel given input types) is a bit ad > > hoc. Many kernels do not have the ability to tell you up front what > > type of data will be returned (this is very important in the context > > of a query engine runtime) > > Another observation is that many of our kernels are unable to write > into preallocated memory. This is also a design issue that must be > addressed (kernels having the same output type invoked in succession > should be able to elide temporary allocations by writing into the same > output memory) > > > * It's unclear that having a large collection of > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable. > > > > Rather, it might be better to have something like: > > > > // We know the "schema" of each argument but don't have any data yet > > std::string kernel_name = "$name"; > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, > > options); > > auto out = bound_kernel->Call({arg0, arg1}); > > > > So here it would be > > > > ARROW_ASSIGN_OR_RAISE(auto take_kernel, > > GetKernel("take", {shapes::chunked_array(int8()), > > shapes::chunked_array(int8())})); > > > > So now take_kernel->shape() should tell you that the expected output > > is shapes::chunked_array(int8()) > > > > And Call should preferably accept reference arguments for its input > > parameters. > > > > As far as kernel-specific options, we could create a variant that > > includes the different kernel-specific options structures, so that > > it's not necessary to have different dispatch functions for different > > kernels. > > > > Anyway, just some out loud thoughts, but while we are in this > > intermediate experimental state I'm supportive of you making the > > decision that is most convenient for the immediate problem you want to > > solve with the caveat that things may be refactored significantly in > > the proximate future. > > > > - Wes > > > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn wrote: > > > > > > I did a bit more research on JIRA and we seem to have this open topic > > > there also in https://issues.apache.org/jira/browse/ARROW-6959 which is > > > the similar topic as my mail is about and in > > > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some > > > of the interfaces with reference-types. > > > > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote: > > > > Hello all, > > > > > > > > I'm in the progress of changing the implementation of the Take kernel > > > > to work on ChunkedArrays without concatenating them into a single Array > > > > first. While working on the implementation, I realised that we switch > > > > often between Datum and the specific-typed parameters. This works quite > > > > well for the combination of Array& and Datum(shared_ptr) as > > > > here the reference object with type Array& always carries a shared > > > > reference with it, so switching between Array& and its Datum is quite > > > > easy. > > > > > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum > > > > requires a shared_ptr which cannot be constructed from > > > > the reference type. Thus to allow interfaces like `Status > > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array& > > > > indices,` to pass successfully their arguments to the Kernel > > > > implementation, we have to do: > > > > > > > > a) Remove the references from the interface of the Take() function and > > > > use `shared_ptr` instances everywhere. > > > > b) Add interfaces to kernels like the TakeKernel that allow calling > > > > with specific references instead of Datum instances > > > > > > > > Personally I would prefer b) as this allow us to make more use of the > > > > C++ type system and would also avoid the shared_ptr overhead where not > > > > necessary. > > > > > > > > Cheers, > > > > Uw
[jira] [Created] (ARROW-8380) [RUST] StringDictionaryBuilder not publicly exported from arrow::array
Jörn Horstmann created ARROW-8380: - Summary: [RUST] StringDictionaryBuilder not publicly exported from arrow::array Key: ARROW-8380 URL: https://issues.apache.org/jira/browse/ARROW-8380 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jörn Horstmann -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Python] black vs. autopep8
The non-configurability of black is one of the strongest arguments I see for black. The codestyle will always be subjective. From previous discussions I know that my personal preference of readability conflicts with that of Antoine and Wes, so will probably others. We have the same issue with using the Google C++ style guide in C++. Not everyone agrees with it but at least it is a guide that is adopted widely (and thus seen in other projects quite often) and has well outlined reasonings for most of its choices. On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote: > Another option that we can look into is yapf ( > https://github.com/google/yapf). It is similar to black but more tweakable. > Also, it is recently adopted by the Apache Beam project. PR is here > https://github.com/apache/beam/pull/10684/files > > Bin > > On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney wrote: > > > I don't think it's possible unfortunately. From the README: "Black > > reformats entire files in place. It is not configurable." > > > > The main concern about Black is the impact that it has on readability. > > I share this concern as the subjective style choices it makes are > > quite different from the way I've been writing Python code the last 12 > > years. That doesn't mean I'm right and it's wrong, of course. In this > > project, it doesn't seem like we're losing much energy to people > > reformatting arguments lists or adding line breaks here and there, > > etc. FWIW, from reading the Black docs, it doesn't strike me that the > > tool is designed to maximize readability, but rather to make > > formatting deterministic and reduce code diffs caused by reformatting. > > > > My opinion is that we should simply avoid debating code style in code > > reviews as long as the code passes the PEP8 checks. Employing autopep8 > > is an improvement over the status quo (which is that developers must > > fix flake8 / PEP8 warnings by hand). We can always revisit the Black > > discussion in the future if we find that subjective code formatting > > (beyond PEP8 compliance) is taking up our energy inappropriately. > > > > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc wrote: > > > > > > Could we 'tone down' black to get the desired behavior? > > > I'm ok with either tool. > > > > > > Rok > > > > > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney wrote: > > > > > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson > > > > wrote: > > > > > > > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That > > said, I'm > > > > > in favor of any resolution that increases our automation of this and > > > > > decreases the energy we expend debating it. > > > > > > > > It does fix everything, where "everything" is compliance with PEP8, > > > > which I think is the thing we are most interested in. > > > > > > > > Black makes a bunch of other arbitrary (albeit consistent) > > > > reformattings that don't affect PEP8 compliance. > > > > > > > > > Neal > > > > > > > > > > > > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney > > > > wrote: > > > > > > > > > > > Circling back on this, it seems there isn't consensus about > > switching > > > > > > to Black, and using autopep8 at least will give us an easy way to > > > > > > maintain PEP8 compliance and help contributors fix linting failures > > > > > > detected by flake8 (but not all, e.g. unused imports would need to > > be > > > > > > manually removed). Would everyone be on board with using autopep8? > > > > > > > > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney > > > > wrote: > > > > > > > > > > > > > > I'm personally fine with the Black changes. After the one-time > > cost > > > > of > > > > > > > reformatting the codebase, it will take any personal preferences > > out > > > > > > > of code formatting (I admit that I have several myself, but I > > don't > > > > > > > mind the normalization provided by Black). I hope that Cython > > support > > > > > > > comes soon since a great deal of our code is Cython > > > > > > > > > > > > > > On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka < > > > > jacek.plis...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > I believe amount of changes is not that important. > > > > > > > > > > > > > > > > In my opinion, what matters is which format will allow > > reviewers > > > > to be > > > > > > > > more efficient. > > > > > > > > > > > > > > > > The committer can always reformat as they like. It is harder > > for > > > > the > > > > > > reviewer. > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > > > Jacek > > > > > > > > > > > > > > > > czw., 2 kwi 2020 o 15:32 Antoine Pitrou > > > > > > napisał(a): > > > > > > > > > > > > > > > > > > > > > > > > > > > PS: in both cases, Cython files are not processed. autopep8 > > is > > > > > > actually > > > > > > > > > able to process them, but the comparison wouldn't be > > > > > > apples-to-apples. > > > > > > > > > > > > > > > > > > (that said, autopep8