[jira] [Created] (ARROW-9098) RecordBatch::ToStructArray cannot handle record batches with 0 column

2020-06-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9098:


 Summary: RecordBatch::ToStructArray cannot handle record batches 
with 0 column
 Key: ARROW-9098
 URL: https://issues.apache.org/jira/browse/ARROW-9098
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If RecordBatch::ToStructArray is called against a record batch with 0 column, 
the following error will be raised:

Invalid: Can't infer struct array length with 0 child arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray

2020-06-08 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9071:


 Summary: [C++] MakeArrayOfNull makes invalid ListArray
 Key: ARROW-9071
 URL: https://issues.apache.org/jira/browse/ARROW-9071
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Zhuo Peng


One way to reproduce this bug is:

 

>>> a = pa.array([[1, 2]])

>>> b = pa.array([None, None], type=pa.null())

>>> t1 = pa.Table.from_arrays([a], ["a"])
>>> t2 = pa.Table.from_arrays([b], ["b"])

 

>>> pa.concat_tables([t1, t2], promote=True)
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables
 File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array 
invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: 
expected at least 16 byte(s), got 12

(because concat_tables(promote=True) will call MakeArrayOfNulls 
([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)']

 

The code here seems incorrect:

[https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218]

the length of the child array of a ListArray may not equal to the length of the 
ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)

2020-06-04 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9037:


 Summary: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported)
 Key: ARROW-9037
 URL: https://issues.apache.org/jira/browse/ARROW-9037
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If an Array is created with null_count == -1 but without any null (and thus no 
null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if 
null_count is never computed. The exported C struct also has null_count == -1 
[1]. But when importing, if null_count != 0, an error [2] will be raised.

[1] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560

[2] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8277) [Python] RecordBatch interface improvements

2020-03-30 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-8277:


 Summary: [Python] RecordBatch interface improvements
 Key: ARROW-8277
 URL: https://issues.apache.org/jira/browse/ARROW-8277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently __eq__, __repr__ of RecordBatch are not implemented.

compute::Take also supports RecordBatch inputs but there's no python wrapper 
for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7806) [Python] {Array,Table,RecordBatch}.to_pandas() do not support Large variants of ListArray, BinaryArray and StringArray

2020-02-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7806:


 Summary: [Python] {Array,Table,RecordBatch}.to_pandas() do not 
support Large variants of ListArray, BinaryArray and StringArray
 Key: ARROW-7806
 URL: https://issues.apache.org/jira/browse/ARROW-7806
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


For example:

 

>>> a = pa.array([['a']], type=pa.list_(pa.large_binary()))
>>> a.to_pandas()
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/array.pxi", line 468, in pyarrow.lib._PandasConvertible.to_pandas
 File "pyarrow/array.pxi", line 902, in pyarrow.lib.Array._to_pandas
 File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: 
large_binary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7802) [C++] Support for LargeBinary and LargeString in the hash kernel

2020-02-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7802:


 Summary: [C++] Support for LargeBinary and LargeString in the hash 
kernel
 Key: ARROW-7802
 URL: https://issues.apache.org/jira/browse/ARROW-7802
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Currently they are not supported:

https://github.com/apache/arrow/blob/a76e277213e166dbeb148260498995ba053566fb/cpp/src/arrow/compute/kernels/hash.cc#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7510) [C++] Array::null_count() is not thread-compatible

2020-01-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7510:


 Summary: [C++] Array::null_count() is not thread-compatible
 Key: ARROW-7510
 URL: https://issues.apache.org/jira/browse/ARROW-7510
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Zhuo Peng


ArrayData has a mutable member null_count, that can be updated in a const 
function. However null_count is not atomic, so it's subject to data race.

 

I guess Arrays are not thread-safe (which is reasonable), but at least they 
should be thread-compatible so that concurrent access to const member functions 
are fine.

(The race looks "benign", but see [1][2])

[https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/cpp/src/arrow/array.cc#L123]

 

[1][https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong]

[2][https://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7362:


 Summary: [Python] ListArray.flatten() should take care of slicing 
offsets
 Key: ARROW-7362
 URL: https://issues.apache.org/jira/browse/ARROW-7362
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently ListArray.flatten() simply returns the child array. If a ListArray is 
a slice of another ListArray, they will share the same child array, however the 
expected behavior (I think) of flatten() should be returning an Array that's a 
concatenation of all the sub-lists in the ListArray, so the slicing offset 
should be taken into account.

 

For example:

a = pa.array([[1], [2], [3]])

assert a.flatten().equals(pa.array([1,2,3]))

# expected:

a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7229) [C++] Unify ConcatenateTables APIs

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7229:


 Summary: [C++] Unify ConcatenateTables APIs
 Key: ARROW-7229
 URL: https://issues.apache.org/jira/browse/ARROW-7229
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in C++. 
It's anticipated that they will allow more customization/tweaking. To avoid 
complicating the API surface, we should introduce a ConcatenateTableOption 
object, unify the two functions, and allow further customization to be 
expressed in that option object.

Related discussion: 
[https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7228) [Python] Expose RecordBatch.FromStructArray in Python.

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7228:


 Summary: [Python] Expose RecordBatch.FromStructArray in Python.
 Key: ARROW-7228
 URL: https://issues.apache.org/jira/browse/ARROW-7228
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


This API was introduced in ARROW-6243. It will make converting from a list of 
python dicts to a RecordBatch easier:

 

struct_array = pa.array([\{"column1": 1, "column2": 5}, \{"column2": 6}])

record_batch = pa.RecordBatch.from_struct_array(struct_array)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7227:


 Summary: [Python] Provide wrappers for ConcatenateWithPromotion()
 Key: ARROW-7227
 URL: https://issues.apache.org/jira/browse/ARROW-7227
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/5534] Introduced 
ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


ConcatenateTables APIs

2019-11-08 Thread Zhuo Peng
Hi,

https://github.com/apache/arrow/pull/5534 introduced 
ConcatenateTablesWithPromotion(). And there is already a ConcatenateTables() 
function which behaves differently (it requires the tables to have the schema). 
Wes raised a concern in that PR [1] that we might end up having many 
ConcatenateTables*() variants as there are various things that can be tweaked 
and he suggested to introduce a ConcatenateOptions so there is only one 
ConcatenateTables() function. 

While I'm onboard with that idea, I wanted to double check that there is a 
consensus that we should (as of today) merge ConcatenateTables() and 
ConcatenateTablesWithPromotion(), and have an option to do promotion or not (as 
in an earlier comment in the PR, @bkietz advised otherwise, but maybe at that 
point we didn't realize there were potentially many variants).

[1] https://github.com/apache/arrow/pull/5534#discussion_r343745573


Thanks,

Zhuo


[jira] [Created] (ARROW-6848) [C++] Specify -std=c++11 instead of -std=gnu++11 when building

2019-10-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6848:


 Summary: [C++] Specify -std=c++11 instead of -std=gnu++11 when 
building
 Key: ARROW-6848
 URL: https://issues.apache.org/jira/browse/ARROW-6848
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Relevant discussion:

[https://lists.apache.org/thread.html/5807e65d865c1736b3a7a32653ca8bb405d719eb13b8a10b6fe0e904@%3Cdev.arrow.apache.org%3E]

in addition to

set(CMAKE_CXX_STANDARD 11)

, we also need to

set(CMAKE_CXX_EXTENSIONS OFF)

in order to turn off compiler-specific extensions (with GCC, it's -std=gnu++11)

 

This is supposed to be a no-op, because Arrow builds fine with other compilers 
(Clang-LLVM / MSCV). But opening this bug to track any issues with flipping the 
switch.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 19:43:04, Wes McKinney  wrote: 
> On Fri, Oct 4, 2019 at 12:45 PM Zhuo Peng  wrote:
> >
> >
> >
> > On 2019/10/04 17:05:00, Antoine Pitrou  wrote:
> > >
> > > Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > > >
> > > > backports are cool for internal use, but probably not so if a public 
> > > > API accepts it? (because you vendor the headers in (i.e. namespace, 
> > > > symbol names unchanged), they might clash with headers that a client 
> > > > uses).
> > >
> > > This is true unfortunately.
> > >
> > > >>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions 
> > > >>> does the library rely on?
> > > >>
> > > >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 
> > > >> added?
> > > > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > > >
> > > > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> > >
> > > Right, so this is a CMake decision.  I think we require only plain C++11
> > > (but we may enable additional features on some compilers, provided
> > > there's a fallback).
> > Extensions can be disabled through:
> > set(CMAKE_CXX_EXTENSIONS OFF)
> >
> > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html
> >
> > Is that something more desirable than the current state?
> 
> Yes, I think so, I don't think we need to be relying on GNU gcc
> extensions, but we should open a JIRA issue about disabling it in case
> some tests break because of something we didn't realize we were
> depending on.
sg. I'll create one then.
> 
> As far as C++14/17 upgrading, it seems like it will be at least 2
> years before we could upgrade to C++17 given the state of compiler
> support across the spectrum. Using C++17 would mean requiring at least
> VS 2017 on Windows, since at least in the Python world I think
> everything is on VS 2015.
> 
> Are there ways we could create defines to switch between backports and
> STL things (like string_view, optional, etc.) so that developers using
> the Arrow library in a C++17 application can use the built-in types?
This is dangerous unless they build the Arrow library from source with C++17. 
if libarrow takes arrow::string_view but the user gives it a std::string_view, 
it's UB.

If we are talking about allowing users to build Arrow with C++17 and support 
transparently the new STL types in the public APIs, the ABSL[1] library could 
be something to consider.. absl::{string_view,optional,variant} becomes their 
std:: counterparts when compiled under C++17, e.g. [2].

And inline namespaces are used [3] to make sure different libraries can depend 
on different version of absl.

[1] https://abseil.io/ 
[2] 
https://github.com/abseil/abseil-cpp/blob/25597bdfc148e91e27678ec30efa52f4fc8c164f/absl/strings/string_view.h#L38
[3] 
https://github.com/abseil/abseil-cpp/blob/aa844899c937bde5d2b24f276b59997e5b668bde/absl/strings/string_view.h#L38
> 
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> 


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 17:05:00, Antoine Pitrou  wrote: 
> 
> Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > 
> > backports are cool for internal use, but probably not so if a public API 
> > accepts it? (because you vendor the headers in (i.e. namespace, symbol 
> > names unchanged), they might clash with headers that a client uses).
> 
> This is true unfortunately.
> 
> >>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
> >>> the library rely on?
> >>
> >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
> > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > 
> > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> 
> Right, so this is a CMake decision.  I think we require only plain C++11
> (but we may enable additional features on some compilers, provided
> there's a fallback).
Extensions can be disabled through:
set(CMAKE_CXX_EXTENSIONS OFF)

https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html

Is that something more desirable than the current state? 
> 
> Regards
> 
> Antoine.
> 


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 16:53:59, Antoine Pitrou  wrote: 
> 
> Le 04/10/2019 à 18:05, Zhuo Peng a écrit :
> > Dear Arrow maintainers,
> > 
> > Sorry if this was raised before. I did search the mailing list but "C++" 
> > matched too many results..
> > 
> > With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
> > modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
> > devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. 
> > I wonder what are the concerns of adopting a newer standard?
> >  
> > C++14 might not bring a whole lot of interesting features, but C++17 brings:
> > 
> > std::string_view
> > std::optional
> > std::variant (the newly added Result class is based on some form of variant 
> > implementation I suppose?)
> 
> We already have `string_view` and `variant` backports.  We could
> reasonably add a `optional` backport.
> 
backports are cool for internal use, but probably not so if a public API 
accepts it? (because you vendor the headers in (i.e. namespace, symbol names 
unchanged), they might clash with headers that a client uses).

> > And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
> > the library rely on?
> 
> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33

https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> 
> Regards
> 
> Antoine.
> 


Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng
Dear Arrow maintainers,

Sorry if this was raised before. I did search the mailing list but "C++" 
matched too many results..

With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. I 
wonder what are the concerns of adopting a newer standard?
 
C++14 might not bring a whole lot of interesting features, but C++17 brings:

std::string_view
std::optional
std::variant (the newly added Result class is based on some form of variant 
implementation I suppose?)

and many syntax sugar.. (like emplace_back() returning back(), so you can do 
RETURN_NOT_OK(CreateArray(my_array_sp_vector.emplace_back(

And btw, was -std=gnu++11 an intentional choice? what gnu extensions does the 
library rely on?

[1] https://gcc.gnu.org/projects/cxx-status.html



[jira] [Created] (ARROW-6775) Proposal for several Array utility functions

2019-10-02 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6775:


 Summary: Proposal for several Array utility functions
 Key: ARROW-6775
 URL: https://issues.apache.org/jira/browse/ARROW-6775
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Zhuo Peng


Hi,

We developed several utilities that computes / accesses certain properties of 
Arrays and wonder if they make sense to get them into the upstream (into both 
the C++ API and pyarrow) and assuming yes, where is the best place to put them?

Maybe I have overlooked existing APIs that already do the same.. in that case 
please point out.

 

1/ ListLengthFromListArray(ListArray&)

Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for 
large lists). For example:

[[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned 
array can be converted to numpy)

 

2/ GetBinaryArrayTotalByteSize(BinaryArray&)

Returns the total byte size of a BinaryArray (basically offset[len - 1] - 
offset[0]).

Alternatively, a BinaryArray::Flatten() -> Uint8Array would work.

 

3/ GetArrayNullBitmapAsByteArray(Array&)

Returns the array's null bitmap as a UInt8Array (which can be efficiently 
converted to a bool numpy array)

 

4/ GetFlattenedArrayParentIndices(ListArray&)

Makes a int32 array of the same length as the flattened ListArray. 
returned_array[i] == j means i-th element in the flattened ListArray came from 
j-th list in the ListArray.


For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
On Thu, Sep 19, 2019 at 10:56 Antoine Pitrou  wrote:

>
> Le 19/09/2019 à 19:52, Zhuo Peng a écrit :
> >
> > The problems are only potential and theoretical, and won't bite anyone
> > until it occurs though, and it's more likely to happen with pip/wheel
> than
> > with conda.
> >
> > But anyways, this idea is still nice. I could imagine at least in arrow's
> > Python-C-API, there would be a
> >
> > PyObject* pyarrow_array_from_c_protocol(ArrayArray*);
> >
> > this way the C++ APIs can be avoided while still allowing arrays to be
> > created in C/C++ and used in python.
>
> Adding a Python C API function is a nice idea.
> However, I *still* don't understand how it will solve your problem.  The
> Cython modules comprising PyArrow will still call the C++ APIs, with the
> ABI problems that entails.

Those calls are internal to libarrow.so and libarrow_python.so which always
agrees on the ABI.

It’s different from the client library having to create an arrow::Array
which may contain, say a std::vector from gcc5, then pass it to an
Arrow C++ API exposed by libarrow.so, whose definition of std::vector
is from gcc7.

>
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
On Thu, Sep 19, 2019 at 10:18 AM Antoine Pitrou  wrote:

>
> No, the plan for this proposal is to avoid providing a C API.  Each
> Arrow implementation could produce and consume the C data protocol, for
> example the C++ Array class could add these methods:
>
> class Array {
>   // ...
>
>  public:
>   // Export array to the C data protocol
>   void Share(ArrowArray* out);
>   // Import a C data protocol array
>   static Status FromShared(ArrowArray* input,
>std::shared_ptr* out);
> };
>
> Also, I don't know why a C API exposed by the C++ library would solve
> your problem.  You would still have a problem with bundling the .so,
> symbol conflicts if several libraries load libarrow.so, etc.

The problem is mainly about C++ not being able to provide a stable ABI for
templates (thus STL). If Arrow C++ library's public headers contain
templates or definitions from STL, the only way to guarantee safety is to
force the client library use the same toolchain and the same flags with
which the Arrow DSO was built. (Yes, distribution methods like Conda help
mitigate that issue by enforcing a uniform toolchain (almost), but problems
can still occur, if, say a client is built with --std=c++17 while
libarrow.so is built with --std=gnu11 (example at [1]).

The problems are only potential and theoretical, and won't bite anyone
until it occurs though, and it's more likely to happen with pip/wheel than
with conda.

But anyways, this idea is still nice. I could imagine at least in arrow's
Python-C-API, there would be a

PyObject* pyarrow_array_from_c_protocol(ArrayArray*);

this way the C++ APIs can be avoided while still allowing arrays to be
created in C/C++ and used in python.

[1] https://github.com/tensorflow/tensorflow/issues/23561

Regards
>
> Antoine.
>
>
> Le 19/09/2019 à 18:21, Zhuo Peng a écrit :
> > Hi Antoine,
> >
> > I'm also interested in a stable ABI (previously I posted on this mailing
> > list about the ABI issues I had [1]). Does having such an ABI-stable
> > C-struct imply that there will be a set of C-APIs exposed by the Arrow
> > (C++) library (which I think would lead to a solution to all the inherit
> > ABI issues caused by C++)?
> >
> > [1]
> >
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> >
> > On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> >>> I like the idea of a stable ABI for in-processing  that can be used for
> >> in
> >>> process communication.  For instance, there was a recent question on
> >>> stack-overflow on how to solve this [1].
> >>>
> >>> A couple of thoughts/questions:
> >>> * Would ArrowArray also need a self reference for children arrays?
> >>
> >> Yes, I forgot that.  I also think we don't need a separate Buffer
> >> struct, instead the Array struct should own all its buffers.
> >>
> >>> * Should transferring key-value metadata be in scope?
> >>
> >> Yes.  It could either be in the format string or a separate string.  The
> >> upside of a separate string is that a consumer may ignore it trivially
> >> if it doesn't need the information.
> >>
> >> Another open question is for nested types: does the format string
> >> represent the entire type including children?  Or must child types be
> >> read in the child arrays?  If we mimick ArrayData, then the format
> >> string should represent the entire type; it will then be more complex to
> >> parse.
> >>
> >> We should also make sure that extension types fit in the protocol.
> >>
> >>> * Should the API more closely align the IPC spec (pass a schema
> >> separately
> >>> and list of buffers instead of individual arrays)?
> >>
> >> Then you have that's not immediately usable (you have to do some
> >> processing to reconstitute the individual arrays).  One goal here is to
> >> minimize implementation costs for producers and consumers.  The
> >> assumption is a data model similar to the C++ ArrowData model; do we
> >> have implementations that use an entirely different model?  Perhaps I
> >> should take a look :-)
> >>
> >> Note that the draft I posted only concerns arrays.  We may also want to
> >> have a C struct for batches or tables.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Thanks,
> >>> Micah
> >>>
>

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
Hi Antoine,

I'm also interested in a stable ABI (previously I posted on this mailing
list about the ABI issues I had [1]). Does having such an ABI-stable
C-struct imply that there will be a set of C-APIs exposed by the Arrow
(C++) library (which I think would lead to a solution to all the inherit
ABI issues caused by C++)?

[1]
https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E

On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:

>
> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > I like the idea of a stable ABI for in-processing  that can be used for
> in
> > process communication.  For instance, there was a recent question on
> > stack-overflow on how to solve this [1].
> >
> > A couple of thoughts/questions:
> > * Would ArrowArray also need a self reference for children arrays?
>
> Yes, I forgot that.  I also think we don't need a separate Buffer
> struct, instead the Array struct should own all its buffers.
>
> > * Should transferring key-value metadata be in scope?
>
> Yes.  It could either be in the format string or a separate string.  The
> upside of a separate string is that a consumer may ignore it trivially
> if it doesn't need the information.
>
> Another open question is for nested types: does the format string
> represent the entire type including children?  Or must child types be
> read in the child arrays?  If we mimick ArrayData, then the format
> string should represent the entire type; it will then be more complex to
> parse.
>
> We should also make sure that extension types fit in the protocol.
>
> > * Should the API more closely align the IPC spec (pass a schema
> separately
> > and list of buffers instead of individual arrays)?
>
> Then you have that's not immediately usable (you have to do some
> processing to reconstitute the individual arrays).  One goal here is to
> minimize implementation costs for producers and consumers.  The
> assumption is a data model similar to the C++ ArrowData model; do we
> have implementations that use an entirely different model?  Perhaps I
> should take a look :-)
>
> Note that the draft I posted only concerns arrays.  We may also want to
> have a C struct for batches or tables.
>
> Regards
>
> Antoine.
>
>
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> >
> > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Hello,
> >>
> >> One thing that was discussed in the sync call is the ability to easily
> >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> >> libraries in the same process, without bearing the cost of linking to
> >> e.g. the C++ Arrow library.
> >>
> >> (for example: "Duckdb wants to provide an option to return Arrow data of
> >> result sets, but they don't like having Arrow as a dependency")
> >>
> >> One possibility would be to define a C-level protocol similar in spirit
> >> to the Python buffer protocol, which some people may be familiar with
> (*).
> >>
> >> The basic idea is to define a simple C struct, which is ABI-stable and
> >> describes an Arrow away adequately.  The struct can be stack-allocated.
> >> Its definition can also be copied in another project (or interfaced with
> >> using a C FFI layer, depending on the language).
> >>
> >> There is no formal proposal, this message is meant to stir the
> discussion.
> >>
> >> Issues to work out:
> >>
> >> * Memory lifetime issues: where Python simply associates the Py_buffer
> >> with a PyObject owner (a garbage-collected Python object), we need
> >> another means to control lifetime of pointed areas.  One simple
> >> possibility is to include a destructor function pointer in the protocol
> >> struct.
> >>
> >> * Arrow type representation.  We probably need some kind of "format"
> >> mini-language to represent Arrow types, so that a type can be described
> >> using a `const char*`.  Ideally, primitives types at least should be
> >> trivially parsable.  We may take inspiration from Python here (`struct`
> >> module format characters, PEP 3118 format additions).
> >>
> >> Example C struct definition (not a formal proposal!):
> >>
> >> struct ArrowBuffer {
> >>   void* data;
> >>   int64_t nbytes;
> >>   // Called by the consumer when it doesn't need the buffer anymore
> >>   void (*release)(struct ArrowBuffer*);
> >>   // Opaque user data (for e.g. the release callback)
> >>   void* user_data;
> >> };
> >>
> >> struct ArrowArray {
> >>   // Type description
> >>   const char* format;
> >>   // Data description
> >>   int64_t length;
> >>   int64_t null_count;
> >>   int64_t n_buffers;
> >>   // Note: this pointers are probably owned by the ArrowArray struct
> >>   // and will be released and free()ed by the release callback.
> >>   struct BufferDescriptor* buffers;
> >>   struct ArrowDescriptor* dictionary;
> >>   // Called by 

Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-17 Thread Zhuo Peng
Hi Krisztián,

Sorry if it's too late, but is it possible to also include
https://github.com/apache/arrow/pull/4883 in the release? This would help
resolve https://github.com/apache/arrow/issues/4472 .

Thanks,

Zhuo

On Wed, Jul 17, 2019 at 3:00 AM Antoine Pitrou  wrote:

>
> +1 (binding).
>
> Tested on Ubuntu 18.04.2 (x86-64) with CUDA enabled:
>
> - binaries verification worked fine
> - source verification worked until the npm step, which failed (I don't
> have npm installed)
>
> Regards
>
> Antoine.
>
>
> Le 17/07/2019 à 04:54, Krisztián Szűcs a écrit :
> > Hi,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> > Arrow version 0.14.1. This is a patch release consiting of 47 resolved
> > JIRA issues[1].
> >
> > This release candidate is based on commit:
> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 0.14.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> >
> > [1]:
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> > [2]:
> >
> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> > [8]:
> >
> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> > [9]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


[jira] [Created] (ARROW-5894) libgandiva.so.14 is exporting libstdc++ symbols

2019-07-09 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5894:


 Summary: libgandiva.so.14 is exporting libstdc++ symbols
 Key: ARROW-5894
 URL: https://issues.apache.org/jira/browse/ARROW-5894
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Affects Versions: 0.14.0
Reporter: Zhuo Peng


For example:

$ nm libgandiva.so.14 | grep "once_proxy"
018c0a10 T __once_proxy

 

many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
symbols)

 

There seems to be no linker script for libgandiva.so (there was, but was never 
used and got deleted? 
[https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

2019-06-28 Thread Zhuo Peng
Thanks everyone. I think there are two issues being discussed here and I'd
like to keep them separate:

1. the ABI compatibility of Arrow's pip binary release.
It's true that there is no ABI standard and the topic is messy, but
as Antoine pointed out:

> If you'd like to benefit from the PyArrow binary
> packages, including the C++ API, then you need to use the same toolchain
> (or an ABI-compatible toolchain, but I'm afraid there's no clear
> specification of ABI compatibility in g++ / libstdc++ land).

we should be safe. And I think manylinux (which says, everyone should use
GCC/libstdc++, and should not use a GNU ABI version newer than X) and GNU
ABI Policy and Guidelines [1] (which says, Binaries with equivalent
DT_SONAMEs are forward-compatibile, and IIUC the SONAME has been
libstdc++.so.6 for quite a while, since GCC 3.4).

2. the ODR (one definition rule) violation caused by template classes,
specifically STL classes.

Strictly speaking, this is not about ABI compatibility, and sticking to
manylinux does not prevent this problem. The problem is essentially because
the STL headers shipped with GCC change over versions, and there's no
guarantee that those STL classes will have the same layout forever, and the
layout did change without notice (see the example in my original post).

Again, note that manylinux does not specify which toolchain everyone should
use. It merely specifies the maximum version of those fundamental
libraries. And with manylinux2010, people might have more choices in
compiler versions. For example, devtoolset-6 and devtoolset-7 both qualify.

I guess I was asking for a policy or guideline regarding to how to
correctly build things depending on Arrow's pip release. Even if the
guideline says "you need to build your library in this docker image", it's
still an improvement from the current situation. It might greatly limit the
developer's choices, if they also want to depend on some other library, or
they want to use a newer / older GCC verison.

Or maybe we could disallow STL classes in arrow's public headers. This
might not be feasible, because std::shared_ptr and std::vector are used
everywhere.

Or maybe we only allow some "safe" STL classes in the public headers. But
there is no guarantee for them to be safe. It's purely empirical.

On Thu, Jun 20, 2019 at 3:47 PM Zhuo Peng  wrote:

> Dear Arrow maintainers,
>
> I work on several TFX (TensorFlow eXtended) [1] projects (e.g. TensorFlow
> Data Validation [2]) and am trying to use Arrow in them. These projects are
> mostly written in Python but has C++ code as Python extension modules,
> therefore we use both Arrow’s C++ and Python APIs. Our projects are
> distributed through PyPI as binary packages.
>
> The python extension modules are compiled with the headers shipped within
> pyarrow PyPI binary package and are linked with libarrow.so and
> libarrow_python.so in the same package. So far we’ve seen two major
> problems:
>
> * There are STL container definitions in public headers.
>
> It causes problems because the binary for template classes is generated at
> compilation time. And the definition of those template classes might differ
> from compiler to compiler. This might happen even if we use a different GCC
>  version than the one that compiled pyarrow (for example, the layout of
> std::unordered_map<> has changed in GCC 5.2 [3], and arrow::Schema used to
> contain an std::unordered_map<> member [4].)
>
> One might argue that everyone releasing manylinux1 packages should use
> exactly the same compiler, as provided by the pypa docker image, however
> the standard only specifies the maximum versions of corresponding
> fundamental libraries [5]. Newer GCC versions could be backported to work
> with older libraries [6].
>
> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> members in publicly accessible class declarations and will resolve our
> immediate problem, but I wonder if there is, or there should be an explicit
> policy on the ABI compatibility, especially regarding the usage of template
> functions / classes in public interfaces?
>
> * Our wheel cannot pass “auditwheel repair”
>
> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> auditwheel to do so, it refuses to stamp on our wheel because it has
> “external” dependencies.
>
> This seems not an Arrow problem, but I wonder if others in the community
> have had to deal with similar issues and what the resolution is. Our
> current workaround is to manually stamp the wheel.
>
>
> Thanks,
> Zhuo
>
>
> References:
>
> [1] https://github.com/tensorflow/tfx
> [2] http

[jira] [Created] (ARROW-5749) [Python] Add Python binding for Table::CombineChunks()

2019-06-26 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5749:


 Summary: [Python] Add Python binding for Table::CombineChunks()
 Key: ARROW-5749
 URL: https://issues.apache.org/jira/browse/ARROW-5749
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5635) Support "compacting" a table

2019-06-17 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5635:


 Summary: Support "compacting" a table
 Key: ARROW-5635
 URL: https://issues.apache.org/jira/browse/ARROW-5635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Zhuo Peng


A column in a table might consists of multiple chunks. I'm proposing a 
Table.Compact() method that returns a table whose columns are of just one 
chunks, which is the concatenation of the corresponding column's chunks.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5554) Add a python wrapper for arrow::Concatenate

2019-06-11 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5554:


 Summary: Add a python wrapper for arrow::Concatenate
 Key: ARROW-5554
 URL: https://issues.apache.org/jira/browse/ARROW-5554
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Zhuo Peng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5528) Concatenate() crashes when concatenating empty binary arrays.

2019-06-07 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5528:


 Summary: Concatenate() crashes when concatenating empty binary 
arrays.
 Key: ARROW-5528
 URL: https://issues.apache.org/jira/browse/ARROW-5528
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Zhuo Peng
 Fix For: 0.14.0


[https://github.com/brills/arrow/commit/42063bb5297f34d9b98e831264c47add2da68591]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)