[jira] [Created] (ARROW-9098) RecordBatch::ToStructArray cannot handle record batches with 0 column

2020-06-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9098:


 Summary: RecordBatch::ToStructArray cannot handle record batches 
with 0 column
 Key: ARROW-9098
 URL: https://issues.apache.org/jira/browse/ARROW-9098
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If RecordBatch::ToStructArray is called against a record batch with 0 column, 
the following error will be raised:

Invalid: Can't infer struct array length with 0 child arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9071) [C++] MakeArrayOfNull makes invalid ListArray

2020-06-08 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9071:


 Summary: [C++] MakeArrayOfNull makes invalid ListArray
 Key: ARROW-9071
 URL: https://issues.apache.org/jira/browse/ARROW-9071
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Zhuo Peng


One way to reproduce this bug is:

 

>>> a = pa.array([[1, 2]])

>>> b = pa.array([None, None], type=pa.null())

>>> t1 = pa.Table.from_arrays([a], ["a"])
>>> t2 = pa.Table.from_arrays([b], ["b"])

 

>>> pa.concat_tables([t1, t2], promote=True)
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/table.pxi", line 2138, in pyarrow.lib.concat_tables
 File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1: Invalid: List child array 
invalid: Invalid: Buffer #1 too small in array of type int64 and length 2: 
expected at least 16 byte(s), got 12

(because concat_tables(promote=True) will call MakeArrayOfNulls 
([https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647))|https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/table.cc#L647)']

 

The code here seems incorrect:

[https://github.com/apache/arrow/blob/ec3bae18157723411bb772fca628cbd02eea5c25/cpp/src/arrow/array/util.cc#L218]

the length of the child array of a ListArray may not equal to the length of the 
ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)

2020-06-04 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9037:


 Summary: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported)
 Key: ARROW-9037
 URL: https://issues.apache.org/jira/browse/ARROW-9037
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If an Array is created with null_count == -1 but without any null (and thus no 
null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if 
null_count is never computed. The exported C struct also has null_count == -1 
[1]. But when importing, if null_count != 0, an error [2] will be raised.

[1] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560

[2] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8277) [Python] RecordBatch interface improvements

2020-03-30 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-8277:


 Summary: [Python] RecordBatch interface improvements
 Key: ARROW-8277
 URL: https://issues.apache.org/jira/browse/ARROW-8277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently __eq__, __repr__ of RecordBatch are not implemented.

compute::Take also supports RecordBatch inputs but there's no python wrapper 
for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7806) [Python] {Array,Table,RecordBatch}.to_pandas() do not support Large variants of ListArray, BinaryArray and StringArray

2020-02-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7806:


 Summary: [Python] {Array,Table,RecordBatch}.to_pandas() do not 
support Large variants of ListArray, BinaryArray and StringArray
 Key: ARROW-7806
 URL: https://issues.apache.org/jira/browse/ARROW-7806
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


For example:

 

>>> a = pa.array([['a']], type=pa.list_(pa.large_binary()))
>>> a.to_pandas()
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/array.pxi", line 468, in pyarrow.lib._PandasConvertible.to_pandas
 File "pyarrow/array.pxi", line 902, in pyarrow.lib.Array._to_pandas
 File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: 
large_binary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7802) [C++] Support for LargeBinary and LargeString in the hash kernel

2020-02-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7802:


 Summary: [C++] Support for LargeBinary and LargeString in the hash 
kernel
 Key: ARROW-7802
 URL: https://issues.apache.org/jira/browse/ARROW-7802
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Currently they are not supported:

https://github.com/apache/arrow/blob/a76e277213e166dbeb148260498995ba053566fb/cpp/src/arrow/compute/kernels/hash.cc#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7510) [C++] Array::null_count() is not thread-compatible

2020-01-07 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7510:


 Summary: [C++] Array::null_count() is not thread-compatible
 Key: ARROW-7510
 URL: https://issues.apache.org/jira/browse/ARROW-7510
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Zhuo Peng


ArrayData has a mutable member null_count, that can be updated in a const 
function. However null_count is not atomic, so it's subject to data race.

 

I guess Arrays are not thread-safe (which is reasonable), but at least they 
should be thread-compatible so that concurrent access to const member functions 
are fine.

(The race looks "benign", but see [1][2])

[https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/cpp/src/arrow/array.cc#L123]

 

[1][https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong]

[2][https://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-09 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7362:


 Summary: [Python] ListArray.flatten() should take care of slicing 
offsets
 Key: ARROW-7362
 URL: https://issues.apache.org/jira/browse/ARROW-7362
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Currently ListArray.flatten() simply returns the child array. If a ListArray is 
a slice of another ListArray, they will share the same child array, however the 
expected behavior (I think) of flatten() should be returning an Array that's a 
concatenation of all the sub-lists in the ListArray, so the slicing offset 
should be taken into account.

 

For example:

a = pa.array([[1], [2], [3]])

assert a.flatten().equals(pa.array([1,2,3]))

# expected:

a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7229) [C++] Unify ConcatenateTables APIs

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7229:


 Summary: [C++] Unify ConcatenateTables APIs
 Key: ARROW-7229
 URL: https://issues.apache.org/jira/browse/ARROW-7229
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhuo Peng
Assignee: Zhuo Peng


Today we have ConcatenateTables() and ConcatenateTablesWithPromotion() in C++. 
It's anticipated that they will allow more customization/tweaking. To avoid 
complicating the API surface, we should introduce a ConcatenateTableOption 
object, unify the two functions, and allow further customization to be 
expressed in that option object.

Related discussion: 
[https://lists.apache.org/thread.html/1fa85b078dae09639de04afcf948aad1bfabd48ea8a38e33969495c5@%3Cdev.arrow.apache.org%3E]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7228) [Python] Expose RecordBatch.FromStructArray in Python.

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7228:


 Summary: [Python] Expose RecordBatch.FromStructArray in Python.
 Key: ARROW-7228
 URL: https://issues.apache.org/jira/browse/ARROW-7228
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


This API was introduced in ARROW-6243. It will make converting from a list of 
python dicts to a RecordBatch easier:

 

struct_array = pa.array([\{"column1": 1, "column2": 5}, \{"column2": 6}])

record_batch = pa.RecordBatch.from_struct_array(struct_array)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-11-21 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-7227:


 Summary: [Python] Provide wrappers for ConcatenateWithPromotion()
 Key: ARROW-7227
 URL: https://issues.apache.org/jira/browse/ARROW-7227
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/5534] Introduced 
ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6848) [C++] Specify -std=c++11 instead of -std=gnu++11 when building

2019-10-10 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6848:


 Summary: [C++] Specify -std=c++11 instead of -std=gnu++11 when 
building
 Key: ARROW-6848
 URL: https://issues.apache.org/jira/browse/ARROW-6848
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zhuo Peng


Relevant discussion:

[https://lists.apache.org/thread.html/5807e65d865c1736b3a7a32653ca8bb405d719eb13b8a10b6fe0e904@%3Cdev.arrow.apache.org%3E]

in addition to

set(CMAKE_CXX_STANDARD 11)

, we also need to

set(CMAKE_CXX_EXTENSIONS OFF)

in order to turn off compiler-specific extensions (with GCC, it's -std=gnu++11)

 

This is supposed to be a no-op, because Arrow builds fine with other compilers 
(Clang-LLVM / MSCV). But opening this bug to track any issues with flipping the 
switch.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6775) Proposal for several Array utility functions

2019-10-02 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-6775:


 Summary: Proposal for several Array utility functions
 Key: ARROW-6775
 URL: https://issues.apache.org/jira/browse/ARROW-6775
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Zhuo Peng


Hi,

We developed several utilities that computes / accesses certain properties of 
Arrays and wonder if they make sense to get them into the upstream (into both 
the C++ API and pyarrow) and assuming yes, where is the best place to put them?

Maybe I have overlooked existing APIs that already do the same.. in that case 
please point out.

 

1/ ListLengthFromListArray(ListArray&)

Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for 
large lists). For example:

[[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned 
array can be converted to numpy)

 

2/ GetBinaryArrayTotalByteSize(BinaryArray&)

Returns the total byte size of a BinaryArray (basically offset[len - 1] - 
offset[0]).

Alternatively, a BinaryArray::Flatten() -> Uint8Array would work.

 

3/ GetArrayNullBitmapAsByteArray(Array&)

Returns the array's null bitmap as a UInt8Array (which can be efficiently 
converted to a bool numpy array)

 

4/ GetFlattenedArrayParentIndices(ListArray&)

Makes a int32 array of the same length as the flattened ListArray. 
returned_array[i] == j means i-th element in the flattened ListArray came from 
j-th list in the ListArray.


For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-5894) libgandiva.so.14 is exporting libstdc++ symbols

2019-07-09 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5894:


 Summary: libgandiva.so.14 is exporting libstdc++ symbols
 Key: ARROW-5894
 URL: https://issues.apache.org/jira/browse/ARROW-5894
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Affects Versions: 0.14.0
Reporter: Zhuo Peng


For example:

$ nm libgandiva.so.14 | grep "once_proxy"
018c0a10 T __once_proxy

 

many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
symbols)

 

There seems to be no linker script for libgandiva.so (there was, but was never 
used and got deleted? 
[https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5749) [Python] Add Python binding for Table::CombineChunks()

2019-06-26 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5749:


 Summary: [Python] Add Python binding for Table::CombineChunks()
 Key: ARROW-5749
 URL: https://issues.apache.org/jira/browse/ARROW-5749
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Zhuo Peng
Assignee: Zhuo Peng
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5635) Support "compacting" a table

2019-06-17 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5635:


 Summary: Support "compacting" a table
 Key: ARROW-5635
 URL: https://issues.apache.org/jira/browse/ARROW-5635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Zhuo Peng


A column in a table might consists of multiple chunks. I'm proposing a 
Table.Compact() method that returns a table whose columns are of just one 
chunks, which is the concatenation of the corresponding column's chunks.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5554) Add a python wrapper for arrow::Concatenate

2019-06-11 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5554:


 Summary: Add a python wrapper for arrow::Concatenate
 Key: ARROW-5554
 URL: https://issues.apache.org/jira/browse/ARROW-5554
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Zhuo Peng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5528) Concatenate() crashes when concatenating empty binary arrays.

2019-06-07 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5528:


 Summary: Concatenate() crashes when concatenating empty binary 
arrays.
 Key: ARROW-5528
 URL: https://issues.apache.org/jira/browse/ARROW-5528
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Zhuo Peng
 Fix For: 0.14.0


[https://github.com/brills/arrow/commit/42063bb5297f34d9b98e831264c47add2da68591]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)