Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-21 Thread Fan Liya
This is an interesting question.
IMO, to support repeated values, we also need to design a "coherency
protocol", to avoid the scenario where once a value is witten, the change
is propagated to another slot unexpectedly.

Best,
Liya Fan

On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield 
wrote:

> Hmm, I also thought the intention was monotonically increasing. I can't
> think of a strong reason one way or another. If the argument about code to
> do random access is the same in all cases, is there any benefit to forcing
> any order at all?  Memory prefetching?
>
> On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney  wrote:
>
> > hi Antoine,
> >
> > It's a good question.
> >
> > The intent when we wrote the specification was to be strictly
> > monotonic, but there seems nothing especially harmful about relaxing
> > the constraint to allow for repeated values or even non-monotonicity
> > (strict or otherwise). For example, if we had the union
> >
> > ['a', 'a', 'a', 0, 1, 'b', 'b']
> >
> > then this could be represented as
> >
> > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > offsets: [0, 0, 0, 0, 1, 1, 1]
> > child[0]: ['a', 'b']
> > child[1]: [0, 1]
> >
> > or
> >
> > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > offsets: [1, 1, 1, 0, 1, 0, 0]
> > child[0]: ['b', 'a']
> > child[1]: [0, 1]
> >
> > What do others think? Either way some clarification in the
> > specification would be useful. Because the code used to do random
> > access is the same in all cases, I feel weakly supportive of removing
> > constraints on the offsets.
> >
> > - Wes
> >
> > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > I'd like some clarification on the spec and intent for dense arrays.
> > >
> > > Currently, it is specified that offsets of a dense union are "in order
> /
> > > increasing" (*).  However, it is not obvious whether repeated values
> are
> > > allowed or not.
> > >
> > > I suspect the intent is to avoid having people exploit unions as some
> > > kind of poor man's dictionaries.  Also, perhaps some optimizations are
> > > possible if monotonic or strictly monotonic indices are assumed?  But I
> > > don't know the history behind the union type.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> >
>


[jira] [Created] (ARROW-7240) [C++] Add Result to APIs to arrow/util

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7240:
--

 Summary: [C++] Add Result to APIs to arrow/util
 Key: ARROW-7240
 URL: https://issues.apache.org/jira/browse/ARROW-7240
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7239) [C++] Add Result to APIs to plasma

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7239:
--

 Summary: [C++] Add Result to APIs to plasma
 Key: ARROW-7239
 URL: https://issues.apache.org/jira/browse/ARROW-7239
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7237) [C++] Add Result to APIs to arrow/json

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7237:
--

 Summary: [C++] Add Result to APIs to arrow/json
 Key: ARROW-7237
 URL: https://issues.apache.org/jira/browse/ARROW-7237
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7238) [C++] Add Result to APIs to arrow/adapters

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7238:
--

 Summary: [C++] Add Result to APIs to arrow/adapters
 Key: ARROW-7238
 URL: https://issues.apache.org/jira/browse/ARROW-7238
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7236) [C++] Add Result to APIs to arrow/csv

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7236:
--

 Summary: [C++] Add Result to APIs to arrow/csv
 Key: ARROW-7236
 URL: https://issues.apache.org/jira/browse/ARROW-7236
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7235) [C++] Add Result to APIs to arrow/io

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7235:
--

 Summary: [C++] Add Result to APIs to arrow/io
 Key: ARROW-7235
 URL: https://issues.apache.org/jira/browse/ARROW-7235
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7234) [C++] Add Result to APIs to Gandiva

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7234:
--

 Summary: [C++] Add Result to APIs to Gandiva
 Key: ARROW-7234
 URL: https://issues.apache.org/jira/browse/ARROW-7234
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7232) [C++] Add Result to APIs to core vector structures

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7232:
--

 Summary: [C++] Add Result to APIs to core vector structures
 Key: ARROW-7232
 URL: https://issues.apache.org/jira/browse/ARROW-7232
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7233) [C++] Add Result APIs to IPC module

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7233:
--

 Summary: [C++] Add Result APIs to IPC module
 Key: ARROW-7233
 URL: https://issues.apache.org/jira/browse/ARROW-7233
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7231) [C++] Parent bug for tracking migration to Result

2019-11-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7231:
--

 Summary: [C++] Parent bug for tracking migration to Result
 Key: ARROW-7231
 URL: https://issues.apache.org/jira/browse/ARROW-7231
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-21 Thread Micah Kornfield
Hmm, I also thought the intention was monotonically increasing. I can't
think of a strong reason one way or another. If the argument about code to
do random access is the same in all cases, is there any benefit to forcing
any order at all?  Memory prefetching?

On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney  wrote:

> hi Antoine,
>
> It's a good question.
>
> The intent when we wrote the specification was to be strictly
> monotonic, but there seems nothing especially harmful about relaxing
> the constraint to allow for repeated values or even non-monotonicity
> (strict or otherwise). For example, if we had the union
>
> ['a', 'a', 'a', 0, 1, 'b', 'b']
>
> then this could be represented as
>
> type_ids: [0, 0, 0, 1, 1, 0, 0]
> offsets: [0, 0, 0, 0, 1, 1, 1]
> child[0]: ['a', 'b']
> child[1]: [0, 1]
>
> or
>
> type_ids: [0, 0, 0, 1, 1, 0, 0]
> offsets: [1, 1, 1, 0, 1, 0, 0]
> child[0]: ['b', 'a']
> child[1]: [0, 1]
>
> What do others think? Either way some clarification in the
> specification would be useful. Because the code used to do random
> access is the same in all cases, I feel weakly supportive of removing
> constraints on the offsets.
>
> - Wes
>
> On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > I'd like some clarification on the spec and intent for dense arrays.
> >
> > Currently, it is specified that offsets of a dense union are "in order /
> > increasing" (*).  However, it is not obvious whether repeated values are
> > allowed or not.
> >
> > I suspect the intent is to avoid having people exploit unions as some
> > kind of poor man's dictionaries.  Also, perhaps some optimizations are
> > possible if monotonic or strictly monotonic indices are assumed?  But I
> > don't know the history behind the union type.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
>


Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Micah Kornfield
>
> I think we should mostly be careful about public APIs. With public
> APIs we should write out the types and avoid aliases. With
> implementation details and private/protected class members, I think it
> is fine to use aliases.

My concern with this is that in general if the types are in the header
files they have a way of leaking out (whether intentional or not).



On Thu, Nov 21, 2019 at 12:06 PM Wes McKinney  wrote:

> I think we should mostly be careful about public APIs. With public
> APIs we should write out the types and avoid aliases. With
> implementation details and private/protected class members, I think it
> is fine to use aliases.
>
> On Thu, Nov 21, 2019 at 11:06 AM Antoine Pitrou 
> wrote:
> >
> > On Thu, 21 Nov 2019 08:40:10 -0500
> > Francois Saint-Jacques  wrote:
> > > This notation is already used in some parts of the codebase [1]. I
> > > think it was introduced when absorbing gandiva and then in a draft of
> > > the logical operations in the compute module. I have no strong opinion
> > > for/against. I find it convenient to reduce typing, but the style
> > > guide argue against this.
> > >
> > > What about other aliases (Vector & Iterator)? If we revert this
> > > change, we should do it uniformly, e.g. in gandiva and compute.
> >
> > Vector and Iterator sound ok to me (though Iterator could yield some
> > confusion with STL iterators, and Iterator isn't really longer to
> > type than TIterator).
> >
> > Regards
> >
> > Antoine.
> >
> >
>


Re: Creating arrays from existing arrays in Cython

2019-11-21 Thread Suhail Razzak
Hi Micah,

I was trying to create an Int64Builder class but kept getting a type
identifier error. So, I did a bit of digging and realized I was looking at
the latest commit of libarrow.pxd on GitHub which wasn't actually released
as part of 0.15.1.

Thanks for your help anyways!

Suhail

On Sat, Nov 16, 2019 at 11:20 PM Micah Kornfield 
wrote:

> Hi Suhail,
> I'm not sure there are any convenience function to initialize an
> ArrayBuilder class from an existing Array.  But I imagine you should be
> able to use the cython definitions in
> "python//pyarrow/includes/libarrow.pxd" and use it in the way you
> describe.  It might help if you can provide a pointer to minimal code
> sample.
>
> Thanks,
> Micah
>
>
>
>
>
> On Fri, Nov 15, 2019 at 1:21 PM Suhail Razzak 
> wrote:
>
> > Hi,
> >
> > I'm trying to create arrays from an existing array but I'm not sure how
> > exactly to do it. I tried using the ArrayBuilder class, but I keep
> getting
> > compiler errors when trying to instantiate one...
> >
> > So I have a couple questions then:
> >
> > 1. How would I instantiate and use an ArrayBuilder class?
> > 2. Would I build it the same as the C++ way? I.e. builder.get().Append()
> > and then builder.get().Finish(new_array)?
> > 3. How can I access the underlying data of an Array? I keep getting an
> > IndexError when trying array.get().data()[i]
> >
> > I'm kind of new to Cython too, sorry if this seems dumb.
> >
> > Thanks,
> > Suhail
> >
>


-- 
Regards,

Suhail


[jira] [Created] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva

2019-11-21 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7230:
---

 Summary: [C++] Use vendored std::optional instead of 
boost::optional in Gandiva
 Key: ARROW-7230
 URL: https://issues.apache.org/jira/browse/ARROW-7230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney


This may help with overall codebase consistency



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7229) [C++] Unify ConcatenateTables APIs

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7229:


 Summary: [C++] Unify ConcatenateTables APIs
 Key: ARROW-7229
 URL: https://issues.apache.org/jira/browse/ARROW-7229
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in C++. 
It's anticipated that they will allow more customization/tweaking. To avoid 
complicating the API surface, we should introduce a ConcatenateTableOption 
object, unify the two functions, and allow further customization to be 
expressed in that option object.

Related discussion: 
[https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7228) [Python] Expose RecordBatch.FromStructArray in Python.

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7228:


 Summary: [Python] Expose RecordBatch.FromStructArray in Python.
 Key: ARROW-7228
 URL: https://issues.apache.org/jira/browse/ARROW-7228
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


This API was introduced in ARROW-6243. It will make converting from a list of 
python dicts to a RecordBatch easier:

 

struct_array = pa.array([\{"column1": 1, "column2": 5}, \{"column2": 6}])

record_batch = pa.RecordBatch.from_struct_array(struct_array)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-21 Thread Micah Kornfield
Forgot to say,  My vote is +1 (binding).

On Thu, Nov 21, 2019 at 12:09 PM Wes McKinney  wrote:

> +1 (binding). Thanks Micah
>
> On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield 
> wrote:
> >
> > Hello,
> > As discussed on [1], I've proposed clarifications in a PR [2] that
> > clarifies:
> >
> > 1.  It is not required that all dictionary batches occur at the beginning
> > of the IPC stream format (if a the first record batch has an all null
> > dictionary encoded column, the null column's dictionary might not be sent
> > until later in the stream).
> >
> > 2.  A second dictionary batch for the same ID that is not a "delta batch"
> > in an IPC stream indicates the dictionary should be replaced.
> >
> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > dictionary batch and multiple "delta" dictionary batches. Dictionary
> > replacement is not supported in the file format.
> >
> > 4.  Add an enum to dictionary metadata for possible future changes in
> what
> > format dictionary batches can be sent. (the most likely would be an array
> > Map).  An enum is needed as a place holder to allow for
> forward
> > compatibility past the release 1.0.0.
> >
> > If accepted there will be work in all implementations to make sure that
> > they cover the edge cases highlighted and additional integration testing
> > will be needed.
> >
> > Please vote whether to accept these additions. The vote will be open for
> at
> > least 72 hours.
> >
> > [ ] +1 Accept these change to the specification
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> > [2] https://github.com/apache/arrow/pull/5585
>


Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-21 Thread Wes McKinney
+1 (binding). Thanks Micah

On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield  wrote:
>
> Hello,
> As discussed on [1], I've proposed clarifications in a PR [2] that
> clarifies:
>
> 1.  It is not required that all dictionary batches occur at the beginning
> of the IPC stream format (if a the first record batch has an all null
> dictionary encoded column, the null column's dictionary might not be sent
> until later in the stream).
>
> 2.  A second dictionary batch for the same ID that is not a "delta batch"
> in an IPC stream indicates the dictionary should be replaced.
>
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches. Dictionary
> replacement is not supported in the file format.
>
> 4.  Add an enum to dictionary metadata for possible future changes in what
> format dictionary batches can be sent. (the most likely would be an array
> Map).  An enum is needed as a place holder to allow for forward
> compatibility past the release 1.0.0.
>
> If accepted there will be work in all implementations to make sure that
> they cover the edge cases highlighted and additional integration testing
> will be needed.
>
> Please vote whether to accept these additions. The vote will be open for at
> least 72 hours.
>
> [ ] +1 Accept these change to the specification
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Thanks,
> Micah
>
>
> [1]
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/pull/5585


Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Wes McKinney
I think we should mostly be careful about public APIs. With public
APIs we should write out the types and avoid aliases. With
implementation details and private/protected class members, I think it
is fine to use aliases.

On Thu, Nov 21, 2019 at 11:06 AM Antoine Pitrou  wrote:
>
> On Thu, 21 Nov 2019 08:40:10 -0500
> Francois Saint-Jacques  wrote:
> > This notation is already used in some parts of the codebase [1]. I
> > think it was introduced when absorbing gandiva and then in a draft of
> > the logical operations in the compute module. I have no strong opinion
> > for/against. I find it convenient to reduce typing, but the style
> > guide argue against this.
> >
> > What about other aliases (Vector & Iterator)? If we revert this
> > change, we should do it uniformly, e.g. in gandiva and compute.
>
> Vector and Iterator sound ok to me (though Iterator could yield some
> confusion with STL iterators, and Iterator isn't really longer to
> type than TIterator).
>
> Regards
>
> Antoine.
>
>


Re: Unions: storing type_ids or type_codes?

2019-11-21 Thread Wes McKinney
hi Antoine,

The latter is correct, or at least what is intended in the specification.

For example, if the type metadata indices codes [0, 5, 10], then the
"types" buffer should contain values selected from these values rather
than physical child indexes (which would be [0, 1, 2] in this case)

Thanks

On Thu, Nov 21, 2019 at 9:51 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> There's some ambiguity whether a union array's "types" buffer stores
> physical child ids, or logical type codes.
>
> Some of our C++ tests assume the former:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
>
> Some of our C++ tests assume the latter:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
>
> Critically, no validation of union data is currently implemented in C++
> (ARROW-6157).  I can't parse the Java source code.
>
> Regards
>
> Antoine.
>


Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-21 Thread Wes McKinney
hi Antoine,

It's a good question.

The intent when we wrote the specification was to be strictly
monotonic, but there seems nothing especially harmful about relaxing
the constraint to allow for repeated values or even non-monotonicity
(strict or otherwise). For example, if we had the union

['a', 'a', 'a', 0, 1, 'b', 'b']

then this could be represented as

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [0, 0, 0, 0, 1, 1, 1]
child[0]: ['a', 'b']
child[1]: [0, 1]

or

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [1, 1, 1, 0, 1, 0, 0]
child[0]: ['b', 'a']
child[1]: [0, 1]

What do others think? Either way some clarification in the
specification would be useful. Because the code used to do random
access is the same in all cases, I feel weakly supportive of removing
constraints on the offsets.

- Wes

On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> I'd like some clarification on the spec and intent for dense arrays.
>
> Currently, it is specified that offsets of a dense union are "in order /
> increasing" (*).  However, it is not obvious whether repeated values are
> allowed or not.
>
> I suspect the intent is to avoid having people exploit unions as some
> kind of poor man's dictionaries.  Also, perhaps some optimizations are
> possible if monotonic or strictly monotonic indices are assumed?  But I
> don't know the history behind the union type.
>
> Regards
>
> Antoine.
>
>
> (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union


Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Antoine Pitrou
On Thu, 21 Nov 2019 08:40:10 -0500
Francois Saint-Jacques  wrote:
> This notation is already used in some parts of the codebase [1]. I
> think it was introduced when absorbing gandiva and then in a draft of
> the logical operations in the compute module. I have no strong opinion
> for/against. I find it convenient to reduce typing, but the style
> guide argue against this.
> 
> What about other aliases (Vector & Iterator)? If we revert this
> change, we should do it uniformly, e.g. in gandiva and compute.

Vector and Iterator sound ok to me (though Iterator could yield some
confusion with STL iterators, and Iterator isn't really longer to
type than TIterator).

Regards

Antoine.




Adding stronger warnings about pre-production Arrow IPC implementations (C#, Rust)

2019-11-21 Thread Wes McKinney
hi folks,

We're accruing some bug reports relating to the C# library when it
comes to interop with other languages

Nowhere in

https://github.com/apache/arrow/blob/master/csharp/README.md

is it clearly stated that such problems are to be anticipated.

Until C# participates in the integration tests as a first-class
citizen I think we should insert a highly visible warning to not build
any production applications depending on IPC-level interoperability
(unless you're prepared to roll up your sleeves and debug/fix problems
in the libraries). To be clear, it's good to have the bug reports, but
we should also set expectations appropriately.

Note that this is not stated in the Rust README either, so it is
probably a good idea to do this there, too.

- Wes


[jira] [Created] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7227:


 Summary: [Python] Provide wrappers for ConcatenateWithPromotion()
 Key: ARROW-7227
 URL: https://issues.apache.org/jira/browse/ARROW-7227
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/5534] Introduced 
ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: question about Columnar “Streaming Protocol” Change since 0.14.0

2019-11-21 Thread Wes McKinney
hi Andong,

Yes. Here is the commit implementing these changes

https://github.com/apache/arrow/commit/3eaceec8561d6b783d56f7b82e091c19e7fb043c#diff-32981a13284db7a021131df49e6cd203


- Wes


On Thu, Nov 21, 2019 at 12:44 AM Andong Zhan  wrote:
>
> Hi Arrow developers,
>
> We noticed that since 0.15.0 the columnar streaming protocol changed and 
> cannot be read by the older versions. My question is that is the recent JS 
> library compatible with this new change?
>
> Thanks,
> Andong
>
> --
> Andong zhan
> Software Engineer
>
> Snowflake Inc.
> 450 Concar Drive, San Mateo, CA 94402
> M: +1 443-676-7381 | Email: andong.z...@snowflake.com
>


Unions: storing type_ids or type_codes?

2019-11-21 Thread Antoine Pitrou


Hello,

There's some ambiguity whether a union array's "types" buffer stores
physical child ids, or logical type codes.

Some of our C++ tests assume the former:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123

Some of our C++ tests assume the latter:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955

Critically, no validation of union data is currently implemented in C++
(ARROW-6157).  I can't parse the Java source code.

Regards

Antoine.



Dense unions: monotonic or strictly monotonic offsets?

2019-11-21 Thread Antoine Pitrou


Hello,

I'd like some clarification on the spec and intent for dense arrays.

Currently, it is specified that offsets of a dense union are "in order /
increasing" (*).  However, it is not obvious whether repeated values are
allowed or not.

I suspect the intent is to avoid having people exploit unions as some
kind of poor man's dictionaries.  Also, perhaps some optimizations are
possible if monotonic or strictly monotonic indices are assumed?  But I
don't know the history behind the union type.

Regards

Antoine.


(*) https://arrow.apache.org/docs/format/Columnar.html#dense-union


[jira] [Created] (ARROW-7226) [JSON] Json loader fails on example in documentation.

2019-11-21 Thread Rinke Hoekstra (Jira)
Rinke Hoekstra created ARROW-7226:
-

 Summary: [JSON] Json loader fails on example in documentation.
 Key: ARROW-7226
 URL: https://issues.apache.org/jira/browse/ARROW-7226
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Rinke Hoekstra


I was just trying this with the example found in the pyarrow docs at 
[http://arrow.apache.org/docs/python/json.html]

The documented example does not work. Is this related to this issue, or is it 
another matter?

It says to load the following JSON file:

{{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
 {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"

I fixed this to make it valid (but that's another issue):

{{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
 {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}

Then reading the JSON from a file called `my_data.json`:

{{from pyarrow import json}}
 {{table = json.read_json("my_data.json")}}

Gives the following error:
{code:java}
---}}
 ArrowInvalid Traceback (most recent call last)
  in ()
 1 from pyarrow import json
 > 2 table = json.read_json('test.json')
~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
 in pyarrow._json.read_json()
~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()
ArrowInvalid: JSON parse error: A column changed from object to array
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Francois Saint-Jacques
This notation is already used in some parts of the codebase [1]. I
think it was introduced when absorbing gandiva and then in a draft of
the logical operations in the compute module. I have no strong opinion
for/against. I find it convenient to reduce typing, but the style
guide argue against this.

What about other aliases (Vector & Iterator)? If we revert this
change, we should do it uniformly, e.g. in gandiva and compute.

François

[1] https://gist.github.com/fsaintjacques/18720eebd9de3bb7770586ed8ec0ef6f

On Thu, Nov 21, 2019 at 6:10 AM Antoine Pitrou  wrote:
>
> On Wed, 20 Nov 2019 20:50:12 -0800
> Micah Kornfield  wrote:
> > A recent PR for datasets  [1] seems to have introduced the convention of
> > aliasing "std::shared_ptr" with "TypePtr" for some type.  I think in
> > the past we had decided not to use a convention like this but I can't find
> > the thread.  IMO, I think this generally makes the code less understandable
> > but this is a matter of taste.
>
> I agree this introduces ambiguity for casual readers of the code.  The
> question is whether the savings in typing are worth it.  Personally,
> I've become used to writing "shared_ptr" a lot :-)
>
> Regards
>
> Antoine.
>
>


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-21-0

2019-11-21 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-21-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0

Failed Tasks:
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-osx-clang-py37
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-homebrew-cpp
- test-conda-python-2.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-2.7-pandas-master
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-dask-master
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-conda-python-3.7
- test-ubuntu-14.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-ubuntu-14.04-cpp
- test-ubuntu-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-circle-test-ubuntu-fuzzit
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp35m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp36m
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp27m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-travis-wheel-manylinux2010-cp37m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-linux-gcc-py37
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-21-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  

[jira] [Created] (ARROW-7225) [C++] `*std::move(Result)` calls T copy constructor

2019-11-21 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7225:
-

 Summary: [C++] `*std::move(Result)` calls T copy constructor
 Key: ARROW-7225
 URL: https://issues.apache.org/jira/browse/ARROW-7225
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.1
Reporter: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Antoine Pitrou
On Wed, 20 Nov 2019 20:50:12 -0800
Micah Kornfield  wrote:
> A recent PR for datasets  [1] seems to have introduced the convention of
> aliasing "std::shared_ptr" with "TypePtr" for some type.  I think in
> the past we had decided not to use a convention like this but I can't find
> the thread.  IMO, I think this generally makes the code less understandable
> but this is a matter of taste.

I agree this introduces ambiguity for casual readers of the code.  The
question is whether the savings in typing are worth it.  Personally,
I've become used to writing "shared_ptr" a lot :-)

Regards

Antoine.




Re: MIME type

2019-11-21 Thread Sutou Kouhei
I found Apache Thrift registers the following MIME types:

  * application/vnd.apache.thrift.binary
  * application/vnd.apache.thrift.compact
  * application/vnd.apache.thrift.json

https://www.iana.org/assignments/media-types/media-types.xhtml

Thrift uses "vnd.apache." prefix[1].

[1] https://tools.ietf.org/html/rfc6838
> Vendor-tree registrations will be distinguished by the leading facet
> "vnd.".  That may be followed, at the discretion of the registrant,
> by either a media subtype name from a well-known producer (e.g.,
> "vnd.mudpie") or by an IANA-approved designation of the producer's
> name that is followed by a media type or product designation (e.g.,
> vnd.bigcompany.funnypictures).

vnd.apache.thrift.binary was registered at 2014-09-09:

https://www.iana.org/assignments/media-types/application/vnd.apache.thrift.binary

Should we register our MIME types to IANA?

It seems that Apache Thrift uses application/x-thift (typo?)
before Apache Thrift registers these MIME types.

> The application/x-thift media type is currently used to describe multiple
> formats/protocols. Communications endpoints need to the format/protocol
> used, so this media type should be used preferentially when it is
> appropriate to do so.

In 
  "Re: MIME type" on Wed, 20 Nov 2019 12:01:54 +0100,
  Antoine Pitrou  wrote:

> 
> If it's not standardized, shouldn't it be prefixed with x-?
> 
> e.g. application/x-apache-arrow-stream
> 
> 
> Le 20/11/2019 à 08:29, Micah Kornfield a écrit :
>> I would propose:
>> application/apache-arrow-stream
>> application/apache-arrow-file
>> 
>> I'm not attached to those names but I think there should be two different
>> mime-types, since the formats are not interchangeable.
>> 
>> On Tue, Nov 19, 2019 at 10:31 PM Sutou Kouhei  wrote:
>> 
>>> Hi,
>>>
>>> What MIME type should be used for Apache Arrow data?
>>> application/arrow?
>>>
>>> Should we use the same MIME type for IPC Streaming Format[1]
>>> and IPC File Format[2]? Or should we use different MIME
>>> types for them?
>>>
>>> [1]
>>> https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
>>> [2] https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>