Re: Dense unions: monotonic or strictly monotonic offsets?
This is an interesting question. IMO, to support repeated values, we also need to design a "coherency protocol", to avoid the scenario where once a value is witten, the change is propagated to another slot unexpectedly. Best, Liya Fan On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield wrote: > Hmm, I also thought the intention was monotonically increasing. I can't > think of a strong reason one way or another. If the argument about code to > do random access is the same in all cases, is there any benefit to forcing > any order at all? Memory prefetching? > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney wrote: > > > hi Antoine, > > > > It's a good question. > > > > The intent when we wrote the specification was to be strictly > > monotonic, but there seems nothing especially harmful about relaxing > > the constraint to allow for repeated values or even non-monotonicity > > (strict or otherwise). For example, if we had the union > > > > ['a', 'a', 'a', 0, 1, 'b', 'b'] > > > > then this could be represented as > > > > type_ids: [0, 0, 0, 1, 1, 0, 0] > > offsets: [0, 0, 0, 0, 1, 1, 1] > > child[0]: ['a', 'b'] > > child[1]: [0, 1] > > > > or > > > > type_ids: [0, 0, 0, 1, 1, 0, 0] > > offsets: [1, 1, 1, 0, 1, 0, 0] > > child[0]: ['b', 'a'] > > child[1]: [0, 1] > > > > What do others think? Either way some clarification in the > > specification would be useful. Because the code used to do random > > access is the same in all cases, I feel weakly supportive of removing > > constraints on the offsets. > > > > - Wes > > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou > wrote: > > > > > > > > > Hello, > > > > > > I'd like some clarification on the spec and intent for dense arrays. > > > > > > Currently, it is specified that offsets of a dense union are "in order > / > > > increasing" (*). However, it is not obvious whether repeated values > are > > > allowed or not. > > > > > > I suspect the intent is to avoid having people exploit unions as some > > > kind of poor man's dictionaries. Also, perhaps some optimizations are > > > possible if monotonic or strictly monotonic indices are assumed? But I > > > don't know the history behind the union type. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union > > >
[jira] [Created] (ARROW-7240) [C++] Add Result to APIs to arrow/util
Micah Kornfield created ARROW-7240: -- Summary: [C++] Add Result to APIs to arrow/util Key: ARROW-7240 URL: https://issues.apache.org/jira/browse/ARROW-7240 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7239) [C++] Add Result to APIs to plasma
Micah Kornfield created ARROW-7239: -- Summary: [C++] Add Result to APIs to plasma Key: ARROW-7239 URL: https://issues.apache.org/jira/browse/ARROW-7239 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7237) [C++] Add Result to APIs to arrow/json
Micah Kornfield created ARROW-7237: -- Summary: [C++] Add Result to APIs to arrow/json Key: ARROW-7237 URL: https://issues.apache.org/jira/browse/ARROW-7237 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7238) [C++] Add Result to APIs to arrow/adapters
Micah Kornfield created ARROW-7238: -- Summary: [C++] Add Result to APIs to arrow/adapters Key: ARROW-7238 URL: https://issues.apache.org/jira/browse/ARROW-7238 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7236) [C++] Add Result to APIs to arrow/csv
Micah Kornfield created ARROW-7236: -- Summary: [C++] Add Result to APIs to arrow/csv Key: ARROW-7236 URL: https://issues.apache.org/jira/browse/ARROW-7236 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7235) [C++] Add Result to APIs to arrow/io
Micah Kornfield created ARROW-7235: -- Summary: [C++] Add Result to APIs to arrow/io Key: ARROW-7235 URL: https://issues.apache.org/jira/browse/ARROW-7235 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7234) [C++] Add Result to APIs to Gandiva
Micah Kornfield created ARROW-7234: -- Summary: [C++] Add Result to APIs to Gandiva Key: ARROW-7234 URL: https://issues.apache.org/jira/browse/ARROW-7234 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Buffers, Array builders (anythings in the parent directory src/arrow root directory) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7232) [C++] Add Result to APIs to core vector structures
Micah Kornfield created ARROW-7232: -- Summary: [C++] Add Result to APIs to core vector structures Key: ARROW-7232 URL: https://issues.apache.org/jira/browse/ARROW-7232 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Buffers, Array builders (anythings in the parent directory src/arrow root directory) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7233) [C++] Add Result APIs to IPC module
Micah Kornfield created ARROW-7233: -- Summary: [C++] Add Result APIs to IPC module Key: ARROW-7233 URL: https://issues.apache.org/jira/browse/ARROW-7233 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Buffers, Array builders (anythings in the parent directory src/arrow root directory) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7231) [C++] Parent bug for tracking migration to Result
Micah Kornfield created ARROW-7231: -- Summary: [C++] Parent bug for tracking migration to Result Key: ARROW-7231 URL: https://issues.apache.org/jira/browse/ARROW-7231 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Micah Kornfield -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Dense unions: monotonic or strictly monotonic offsets?
Hmm, I also thought the intention was monotonically increasing. I can't think of a strong reason one way or another. If the argument about code to do random access is the same in all cases, is there any benefit to forcing any order at all? Memory prefetching? On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney wrote: > hi Antoine, > > It's a good question. > > The intent when we wrote the specification was to be strictly > monotonic, but there seems nothing especially harmful about relaxing > the constraint to allow for repeated values or even non-monotonicity > (strict or otherwise). For example, if we had the union > > ['a', 'a', 'a', 0, 1, 'b', 'b'] > > then this could be represented as > > type_ids: [0, 0, 0, 1, 1, 0, 0] > offsets: [0, 0, 0, 0, 1, 1, 1] > child[0]: ['a', 'b'] > child[1]: [0, 1] > > or > > type_ids: [0, 0, 0, 1, 1, 0, 0] > offsets: [1, 1, 1, 0, 1, 0, 0] > child[0]: ['b', 'a'] > child[1]: [0, 1] > > What do others think? Either way some clarification in the > specification would be useful. Because the code used to do random > access is the same in all cases, I feel weakly supportive of removing > constraints on the offsets. > > - Wes > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou wrote: > > > > > > Hello, > > > > I'd like some clarification on the spec and intent for dense arrays. > > > > Currently, it is specified that offsets of a dense union are "in order / > > increasing" (*). However, it is not obvious whether repeated values are > > allowed or not. > > > > I suspect the intent is to avoid having people exploit unions as some > > kind of poor man's dictionaries. Also, perhaps some optimizations are > > possible if monotonic or strictly monotonic indices are assumed? But I > > don't know the history behind the union type. > > > > Regards > > > > Antoine. > > > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union >
Re: [DISCUSS][C++] Pointer name aliasing
> > I think we should mostly be careful about public APIs. With public > APIs we should write out the types and avoid aliases. With > implementation details and private/protected class members, I think it > is fine to use aliases. My concern with this is that in general if the types are in the header files they have a way of leaking out (whether intentional or not). On Thu, Nov 21, 2019 at 12:06 PM Wes McKinney wrote: > I think we should mostly be careful about public APIs. With public > APIs we should write out the types and avoid aliases. With > implementation details and private/protected class members, I think it > is fine to use aliases. > > On Thu, Nov 21, 2019 at 11:06 AM Antoine Pitrou > wrote: > > > > On Thu, 21 Nov 2019 08:40:10 -0500 > > Francois Saint-Jacques wrote: > > > This notation is already used in some parts of the codebase [1]. I > > > think it was introduced when absorbing gandiva and then in a draft of > > > the logical operations in the compute module. I have no strong opinion > > > for/against. I find it convenient to reduce typing, but the style > > > guide argue against this. > > > > > > What about other aliases (Vector & Iterator)? If we revert this > > > change, we should do it uniformly, e.g. in gandiva and compute. > > > > Vector and Iterator sound ok to me (though Iterator could yield some > > confusion with STL iterators, and Iterator isn't really longer to > > type than TIterator). > > > > Regards > > > > Antoine. > > > > >
Re: Creating arrays from existing arrays in Cython
Hi Micah, I was trying to create an Int64Builder class but kept getting a type identifier error. So, I did a bit of digging and realized I was looking at the latest commit of libarrow.pxd on GitHub which wasn't actually released as part of 0.15.1. Thanks for your help anyways! Suhail On Sat, Nov 16, 2019 at 11:20 PM Micah Kornfield wrote: > Hi Suhail, > I'm not sure there are any convenience function to initialize an > ArrayBuilder class from an existing Array. But I imagine you should be > able to use the cython definitions in > "python//pyarrow/includes/libarrow.pxd" and use it in the way you > describe. It might help if you can provide a pointer to minimal code > sample. > > Thanks, > Micah > > > > > > On Fri, Nov 15, 2019 at 1:21 PM Suhail Razzak > wrote: > > > Hi, > > > > I'm trying to create arrays from an existing array but I'm not sure how > > exactly to do it. I tried using the ArrayBuilder class, but I keep > getting > > compiler errors when trying to instantiate one... > > > > So I have a couple questions then: > > > > 1. How would I instantiate and use an ArrayBuilder class? > > 2. Would I build it the same as the C++ way? I.e. builder.get().Append() > > and then builder.get().Finish(new_array)? > > 3. How can I access the underlying data of an Array? I keep getting an > > IndexError when trying array.get().data()[i] > > > > I'm kind of new to Cython too, sorry if this seems dumb. > > > > Thanks, > > Suhail > > > -- Regards, Suhail
[jira] [Created] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva
Wes McKinney created ARROW-7230: --- Summary: [C++] Use vendored std::optional instead of boost::optional in Gandiva Key: ARROW-7230 URL: https://issues.apache.org/jira/browse/ARROW-7230 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Gandiva Reporter: Wes McKinney This may help with overall codebase consistency -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7229) [C++] Unify ConcatenateTables APIs
Zhuo Peng created ARROW-7229: Summary: [C++] Unify ConcatenateTables APIs Key: ARROW-7229 URL: https://issues.apache.org/jira/browse/ARROW-7229 Project: Apache Arrow Issue Type: Improvement Reporter: Zhuo Peng Assignee: Zhuo Peng Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in C++. It's anticipated that they will allow more customization/tweaking. To avoid complicating the API surface, we should introduce a ConcatenateTableOption object, unify the two functions, and allow further customization to be expressed in that option object. Related discussion: [https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7228) [Python] Expose RecordBatch.FromStructArray in Python.
Zhuo Peng created ARROW-7228: Summary: [Python] Expose RecordBatch.FromStructArray in Python. Key: ARROW-7228 URL: https://issues.apache.org/jira/browse/ARROW-7228 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Zhuo Peng Assignee: Zhuo Peng Fix For: 1.0.0 This API was introduced in ARROW-6243. It will make converting from a list of python dicts to a RecordBatch easier: struct_array = pa.array([\{"column1": 1, "column2": 5}, \{"column2": 6}]) record_batch = pa.RecordBatch.from_struct_array(struct_array) -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)
Forgot to say, My vote is +1 (binding). On Thu, Nov 21, 2019 at 12:09 PM Wes McKinney wrote: > +1 (binding). Thanks Micah > > On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield > wrote: > > > > Hello, > > As discussed on [1], I've proposed clarifications in a PR [2] that > > clarifies: > > > > 1. It is not required that all dictionary batches occur at the beginning > > of the IPC stream format (if a the first record batch has an all null > > dictionary encoded column, the null column's dictionary might not be sent > > until later in the stream). > > > > 2. A second dictionary batch for the same ID that is not a "delta batch" > > in an IPC stream indicates the dictionary should be replaced. > > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > > dictionary batch and multiple "delta" dictionary batches. Dictionary > > replacement is not supported in the file format. > > > > 4. Add an enum to dictionary metadata for possible future changes in > what > > format dictionary batches can be sent. (the most likely would be an array > > Map). An enum is needed as a place holder to allow for > forward > > compatibility past the release 1.0.0. > > > > If accepted there will be work in all implementations to make sure that > > they cover the edge cases highlighted and additional integration testing > > will be needed. > > > > Please vote whether to accept these additions. The vote will be open for > at > > least 72 hours. > > > > [ ] +1 Accept these change to the specification > > [ ] +0 > > [ ] -1 Do not accept the changes because... > > > > Thanks, > > Micah > > > > > > [1] > > > https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E > > [2] https://github.com/apache/arrow/pull/5585 >
Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)
+1 (binding). Thanks Micah On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield wrote: > > Hello, > As discussed on [1], I've proposed clarifications in a PR [2] that > clarifies: > > 1. It is not required that all dictionary batches occur at the beginning > of the IPC stream format (if a the first record batch has an all null > dictionary encoded column, the null column's dictionary might not be sent > until later in the stream). > > 2. A second dictionary batch for the same ID that is not a "delta batch" > in an IPC stream indicates the dictionary should be replaced. > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > dictionary batch and multiple "delta" dictionary batches. Dictionary > replacement is not supported in the file format. > > 4. Add an enum to dictionary metadata for possible future changes in what > format dictionary batches can be sent. (the most likely would be an array > Map). An enum is needed as a place holder to allow for forward > compatibility past the release 1.0.0. > > If accepted there will be work in all implementations to make sure that > they cover the edge cases highlighted and additional integration testing > will be needed. > > Please vote whether to accept these additions. The vote will be open for at > least 72 hours. > > [ ] +1 Accept these change to the specification > [ ] +0 > [ ] -1 Do not accept the changes because... > > Thanks, > Micah > > > [1] > https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E > [2] https://github.com/apache/arrow/pull/5585
Re: [DISCUSS][C++] Pointer name aliasing
I think we should mostly be careful about public APIs. With public APIs we should write out the types and avoid aliases. With implementation details and private/protected class members, I think it is fine to use aliases. On Thu, Nov 21, 2019 at 11:06 AM Antoine Pitrou wrote: > > On Thu, 21 Nov 2019 08:40:10 -0500 > Francois Saint-Jacques wrote: > > This notation is already used in some parts of the codebase [1]. I > > think it was introduced when absorbing gandiva and then in a draft of > > the logical operations in the compute module. I have no strong opinion > > for/against. I find it convenient to reduce typing, but the style > > guide argue against this. > > > > What about other aliases (Vector & Iterator)? If we revert this > > change, we should do it uniformly, e.g. in gandiva and compute. > > Vector and Iterator sound ok to me (though Iterator could yield some > confusion with STL iterators, and Iterator isn't really longer to > type than TIterator). > > Regards > > Antoine. > >
Re: Unions: storing type_ids or type_codes?
hi Antoine, The latter is correct, or at least what is intended in the specification. For example, if the type metadata indices codes [0, 5, 10], then the "types" buffer should contain values selected from these values rather than physical child indexes (which would be [0, 1, 2] in this case) Thanks On Thu, Nov 21, 2019 at 9:51 AM Antoine Pitrou wrote: > > > Hello, > > There's some ambiguity whether a union array's "types" buffer stores > physical child ids, or logical type codes. > > Some of our C++ tests assume the former: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123 > > Some of our C++ tests assume the latter: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326 > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955 > > Critically, no validation of union data is currently implemented in C++ > (ARROW-6157). I can't parse the Java source code. > > Regards > > Antoine. >
Re: Dense unions: monotonic or strictly monotonic offsets?
hi Antoine, It's a good question. The intent when we wrote the specification was to be strictly monotonic, but there seems nothing especially harmful about relaxing the constraint to allow for repeated values or even non-monotonicity (strict or otherwise). For example, if we had the union ['a', 'a', 'a', 0, 1, 'b', 'b'] then this could be represented as type_ids: [0, 0, 0, 1, 1, 0, 0] offsets: [0, 0, 0, 0, 1, 1, 1] child[0]: ['a', 'b'] child[1]: [0, 1] or type_ids: [0, 0, 0, 1, 1, 0, 0] offsets: [1, 1, 1, 0, 1, 0, 0] child[0]: ['b', 'a'] child[1]: [0, 1] What do others think? Either way some clarification in the specification would be useful. Because the code used to do random access is the same in all cases, I feel weakly supportive of removing constraints on the offsets. - Wes On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou wrote: > > > Hello, > > I'd like some clarification on the spec and intent for dense arrays. > > Currently, it is specified that offsets of a dense union are "in order / > increasing" (*). However, it is not obvious whether repeated values are > allowed or not. > > I suspect the intent is to avoid having people exploit unions as some > kind of poor man's dictionaries. Also, perhaps some optimizations are > possible if monotonic or strictly monotonic indices are assumed? But I > don't know the history behind the union type. > > Regards > > Antoine. > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
Re: [DISCUSS][C++] Pointer name aliasing
On Thu, 21 Nov 2019 08:40:10 -0500 Francois Saint-Jacques wrote: > This notation is already used in some parts of the codebase [1]. I > think it was introduced when absorbing gandiva and then in a draft of > the logical operations in the compute module. I have no strong opinion > for/against. I find it convenient to reduce typing, but the style > guide argue against this. > > What about other aliases (Vector & Iterator)? If we revert this > change, we should do it uniformly, e.g. in gandiva and compute. Vector and Iterator sound ok to me (though Iterator could yield some confusion with STL iterators, and Iterator isn't really longer to type than TIterator). Regards Antoine.
Adding stronger warnings about pre-production Arrow IPC implementations (C#, Rust)
hi folks, We're accruing some bug reports relating to the C# library when it comes to interop with other languages Nowhere in https://github.com/apache/arrow/blob/master/csharp/README.md is it clearly stated that such problems are to be anticipated. Until C# participates in the integration tests as a first-class citizen I think we should insert a highly visible warning to not build any production applications depending on IPC-level interoperability (unless you're prepared to roll up your sleeves and debug/fix problems in the libraries). To be clear, it's good to have the bug reports, but we should also set expectations appropriately. Note that this is not stated in the Rust README either, so it is probably a good idea to do this there, too. - Wes
[jira] [Created] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()
Zhuo Peng created ARROW-7227: Summary: [Python] Provide wrappers for ConcatenateWithPromotion() Key: ARROW-7227 URL: https://issues.apache.org/jira/browse/ARROW-7227 Project: Apache Arrow Issue Type: New Feature Reporter: Zhuo Peng Assignee: Zhuo Peng Fix For: 1.0.0 [https://github.com/apache/arrow/pull/5534] Introduced ConcatenateWithPromotion() to C++. Provide a Python wrapper for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: question about Columnar “Streaming Protocol” Change since 0.14.0
hi Andong, Yes. Here is the commit implementing these changes https://github.com/apache/arrow/commit/3eaceec8561d6b783d56f7b82e091c19e7fb043c#diff-32981a13284db7a021131df49e6cd203 - Wes On Thu, Nov 21, 2019 at 12:44 AM Andong Zhan wrote: > > Hi Arrow developers, > > We noticed that since 0.15.0 the columnar streaming protocol changed and > cannot be read by the older versions. My question is that is the recent JS > library compatible with this new change? > > Thanks, > Andong > > -- > Andong zhan > Software Engineer > > Snowflake Inc. > 450 Concar Drive, San Mateo, CA 94402 > M: +1 443-676-7381 | Email: andong.z...@snowflake.com >
Unions: storing type_ids or type_codes?
Hello, There's some ambiguity whether a union array's "types" buffer stores physical child ids, or logical type codes. Some of our C++ tests assume the former: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123 Some of our C++ tests assume the latter: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326 https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955 Critically, no validation of union data is currently implemented in C++ (ARROW-6157). I can't parse the Java source code. Regards Antoine.
Dense unions: monotonic or strictly monotonic offsets?
Hello, I'd like some clarification on the spec and intent for dense arrays. Currently, it is specified that offsets of a dense union are "in order / increasing" (*). However, it is not obvious whether repeated values are allowed or not. I suspect the intent is to avoid having people exploit unions as some kind of poor man's dictionaries. Also, perhaps some optimizations are possible if monotonic or strictly monotonic indices are assumed? But I don't know the history behind the union type. Regards Antoine. (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
[jira] [Created] (ARROW-7226) [JSON] Json loader fails on example in documentation.
Rinke Hoekstra created ARROW-7226: - Summary: [JSON] Json loader fails on example in documentation. Key: ARROW-7226 URL: https://issues.apache.org/jira/browse/ARROW-7226 Project: Apache Arrow Issue Type: Bug Reporter: Rinke Hoekstra I was just trying this with the example found in the pyarrow docs at [http://arrow.apache.org/docs/python/json.html] The documented example does not work. Is this related to this issue, or is it another matter? It says to load the following JSON file: {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03" {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01" I fixed this to make it valid (but that's another issue): {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}} {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}} Then reading the JSON from a file called `my_data.json`: {{from pyarrow import json}} {{table = json.read_json("my_data.json")}} Gives the following error: {code:java} ---}} ArrowInvalid Traceback (most recent call last) in () 1 from pyarrow import json > 2 table = json.read_json('test.json') ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json() ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: JSON parse error: A column changed from object to array {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS][C++] Pointer name aliasing
This notation is already used in some parts of the codebase [1]. I think it was introduced when absorbing gandiva and then in a draft of the logical operations in the compute module. I have no strong opinion for/against. I find it convenient to reduce typing, but the style guide argue against this. What about other aliases (Vector & Iterator)? If we revert this change, we should do it uniformly, e.g. in gandiva and compute. François [1] https://gist.github.com/fsaintjacques/18720eebd9de3bb7770586ed8ec0ef6f On Thu, Nov 21, 2019 at 6:10 AM Antoine Pitrou wrote: > > On Wed, 20 Nov 2019 20:50:12 -0800 > Micah Kornfield wrote: > > A recent PR for datasets [1] seems to have introduced the convention of > > aliasing "std::shared_ptr" with "TypePtr" for some type. I think in > > the past we had decided not to use a convention like this but I can't find > > the thread. IMO, I think this generally makes the code less understandable > > but this is a matter of taste. > > I agree this introduces ambiguity for casual readers of the code. The > question is whether the savings in typing are worth it. Personally, > I've become used to writing "shared_ptr" a lot :-) > > Regards > > Antoine. > >
[NIGHTLY] Arrow Build Report for Job nightly-2019-11-21-0
Arrow Build Report for Job nightly-2019-11-21-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0 Failed Tasks: - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py37 - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-homebrew-cpp - test-conda-python-2.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-2.7-pandas-master - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-dask-master - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-turbodbc-master - test-conda-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7 - test-ubuntu-14.04-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-ubuntu-14.04-cpp - test-ubuntu-fuzzit: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-ubuntu-fuzzit - wheel-manylinux1-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp27m - wheel-manylinux1-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp27mu - wheel-manylinux1-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp35m - wheel-manylinux1-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp36m - wheel-manylinux1-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp37m - wheel-manylinux2010-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp27m - wheel-manylinux2010-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp27mu - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp35m - wheel-manylinux2010-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp36m - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp37m Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py37 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37:
[jira] [Created] (ARROW-7225) [C++] `*std::move(Result)` calls T copy constructor
Antoine Pitrou created ARROW-7225: - Summary: [C++] `*std::move(Result)` calls T copy constructor Key: ARROW-7225 URL: https://issues.apache.org/jira/browse/ARROW-7225 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.1 Reporter: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS][C++] Pointer name aliasing
On Wed, 20 Nov 2019 20:50:12 -0800 Micah Kornfield wrote: > A recent PR for datasets [1] seems to have introduced the convention of > aliasing "std::shared_ptr" with "TypePtr" for some type. I think in > the past we had decided not to use a convention like this but I can't find > the thread. IMO, I think this generally makes the code less understandable > but this is a matter of taste. I agree this introduces ambiguity for casual readers of the code. The question is whether the savings in typing are worth it. Personally, I've become used to writing "shared_ptr" a lot :-) Regards Antoine.
Re: MIME type
I found Apache Thrift registers the following MIME types: * application/vnd.apache.thrift.binary * application/vnd.apache.thrift.compact * application/vnd.apache.thrift.json https://www.iana.org/assignments/media-types/media-types.xhtml Thrift uses "vnd.apache." prefix[1]. [1] https://tools.ietf.org/html/rfc6838 > Vendor-tree registrations will be distinguished by the leading facet > "vnd.". That may be followed, at the discretion of the registrant, > by either a media subtype name from a well-known producer (e.g., > "vnd.mudpie") or by an IANA-approved designation of the producer's > name that is followed by a media type or product designation (e.g., > vnd.bigcompany.funnypictures). vnd.apache.thrift.binary was registered at 2014-09-09: https://www.iana.org/assignments/media-types/application/vnd.apache.thrift.binary Should we register our MIME types to IANA? It seems that Apache Thrift uses application/x-thift (typo?) before Apache Thrift registers these MIME types. > The application/x-thift media type is currently used to describe multiple > formats/protocols. Communications endpoints need to the format/protocol > used, so this media type should be used preferentially when it is > appropriate to do so. In "Re: MIME type" on Wed, 20 Nov 2019 12:01:54 +0100, Antoine Pitrou wrote: > > If it's not standardized, shouldn't it be prefixed with x-? > > e.g. application/x-apache-arrow-stream > > > Le 20/11/2019 à 08:29, Micah Kornfield a écrit : >> I would propose: >> application/apache-arrow-stream >> application/apache-arrow-file >> >> I'm not attached to those names but I think there should be two different >> mime-types, since the formats are not interchangeable. >> >> On Tue, Nov 19, 2019 at 10:31 PM Sutou Kouhei wrote: >> >>> Hi, >>> >>> What MIME type should be used for Apache Arrow data? >>> application/arrow? >>> >>> Should we use the same MIME type for IPC Streaming Format[1] >>> and IPC File Format[2]? Or should we use different MIME >>> types for them? >>> >>> [1] >>> https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format >>> [2] https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format >>> >>> >>> Thanks, >>> -- >>> kou >>> >>