Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-11 Thread Ying Zhou
Congrats Weston!

> On Jul 9, 2021, at 8:47 AM, Wes McKinney  wrote:
> 
> On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
> 
> Wes



[Python] ascii_trim bug & documentation

2021-06-30 Thread Ying Zhou
Hi,

It seems that pyarrow.compute.ascii_trim can not be used without a TrimOption. 
However a TrimOption can not be given as a keyword only argument either. This 
looks like a bug since utf8_trim does not have this problem. Is my 
understanding correct?

Also it seems that there is a lot of Python documentation we need to write 
especially in compute, ORC and CSV (writer). I would like to do my part. May I 
ask whether C++ and Python API documentation on the site are regularly 
generated from the docstrings and comment blocks in released code for Doxygen? 
Thanks!

Re: [C++] Maximum type code for union types

2021-06-23 Thread Ying Zhou
Thanks! Issue filed. I will work on it.

https://issues.apache.org/jira/browse/ARROW-13154

> On Jun 20, 2021, at 5:26 PM, Wes McKinney  wrote:
> 
> UnionType::kMaxTypeCode is 127, so we intend to have codes from 0 to
> 127. If there is code preventing things from going up to and including
> 127 it's a bug.
> 
> On Sun, Jun 20, 2021 at 3:06 PM Ying Zhou  wrote:
>> 
>> Moreover it seems that negative type_codes are banned due to type.cc:622 
>> <http://type.cc:622/> . Moreover in type_test.cc <http://type_test.cc/> and 
>> array_union_test.cc <http://array_union_test.cc/> type_codes are always 
>> nonnegative. However maybe negative type_codes should be allowed since 
>> type_codes are of type int8_t. Is this also intended?
>> 
>>> On Jun 20, 2021, at 4:01 PM, Ying Zhou  wrote:
>>> 
>>> Hi,
>>> 
>>> Due to the following in builder_union.cc <http://builder_union.cc/> (Line 
>>> 67-70)
>>> 
>>>  type_id_to_children_.resize(union_type.max_type_code() + 1, nullptr);
>>>  DCHECK_LT(
>>>  type_id_to_children_.size(),
>>>  
>>> static_cast(UnionType::kMaxTypeCode));
>>> 
>>> and type.cc <http://type.cc/> (Line 640-644)
>>> uint8_t UnionType::max_type_code() const {
>>>  return type_codes_.size() == 0
>>> ? 0
>>> : *std::max_element(type_codes_.begin(), type_codes_.end());
>>> }
>>> 
>>> In practice type codes of the union type must always be below or equal to 
>>> 125. Is this intended behavior?
>>> 
>> 



[Format] Bounded numbers?

2021-06-21 Thread Ying Zhou
Hi,

In data people use there are often bounded numbers, mostly integers with clear 
and fixed upper and lower bounds but also decimals and floats as well e.g. test 
scores, numerous codes in older databases, max temperature of a city, 
latitudes, longitudes, numerous IDs etc. I wonder whether we should include 
such types in Arrow (and more importantly in Parquet & Avro where size matters 
a lot more).

P.S. An implementation of bounded integers in C++ is here: 
https://github.com/davidstone/bounded-integer

[C++] Maximum type code for union types

2021-06-20 Thread Ying Zhou
Hi,

Due to the following in builder_union.cc  (Line 67-70)

  type_id_to_children_.resize(union_type.max_type_code() + 1, nullptr);
  DCHECK_LT(
  type_id_to_children_.size(),
  
static_cast(UnionType::kMaxTypeCode));

and type.cc  (Line 640-644)
uint8_t UnionType::max_type_code() const {
  return type_codes_.size() == 0
 ? 0
 : *std::max_element(type_codes_.begin(), type_codes_.end());
}

In practice type codes of the union type must always be below or equal to 125. 
Is this intended behavior?



Re: [C++] Maximum type code for union types

2021-06-20 Thread Ying Zhou
Moreover it seems that negative type_codes are banned due to type.cc:622 
<http://type.cc:622/> . Moreover in type_test.cc <http://type_test.cc/> and 
array_union_test.cc <http://array_union_test.cc/> type_codes are always 
nonnegative. However maybe negative type_codes should be allowed since 
type_codes are of type int8_t. Is this also intended? 

> On Jun 20, 2021, at 4:01 PM, Ying Zhou  wrote:
> 
> Hi,
> 
> Due to the following in builder_union.cc <http://builder_union.cc/> (Line 
> 67-70)
> 
>   type_id_to_children_.resize(union_type.max_type_code() + 1, nullptr);
>   DCHECK_LT(
>   type_id_to_children_.size(),
>   
> static_cast(UnionType::kMaxTypeCode));
> 
> and type.cc <http://type.cc/> (Line 640-644)
> uint8_t UnionType::max_type_code() const {
>   return type_codes_.size() == 0
>  ? 0
>  : *std::max_element(type_codes_.begin(), type_codes_.end());
> }
> 
> In practice type codes of the union type must always be below or equal to 
> 125. Is this intended behavior?
> 



Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-04 Thread Ying Zhou


Congrats Dominik!

> On Jun 2, 2021, at 5:19 PM, Wes McKinney  wrote:
> 
> On behalf of the Arrow PMC, I'm happy to announce that Dominik has accepted an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
> 
> Wes



Re: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman

2021-05-07 Thread Ying Zhou
Congrats Ben!

> On May 7, 2021, at 5:50 PM, Benjamin Kietzman  wrote:
> 
> Thanks, all!
> 
> On Thu, May 6, 2021, 22:23 Fan Liya  wrote:
> 
>> Congratulations, Ben!
>> 
>> Best,
>> Liya Fan
>> 
>> On Fri, May 7, 2021 at 4:23 AM Bryan Cutler  wrote:
>> 
>>> Congrats Ben!
>>> 
>>> On Thu, May 6, 2021 at 12:05 PM Antoine Pitrou 
>> wrote:
>>> 
 
 Congratulations Ben :-)
 
 
 Le 06/05/2021 à 21:02, Rok Mihevc a écrit :
> Congrats!
> 
> On Thu, May 6, 2021 at 10:49 AM Krisztián Szűcs <
 szucs.kriszt...@gmail.com>
> wrote:
> 
>> Congrats Ben!
>> 
>> On Thu, May 6, 2021 at 9:20 AM Joris Van den Bossche
>>  wrote:
>>> 
>>> Congrats!
>>> 
>>> On Thu, 6 May 2021 at 07:03, Weston Pace 
 wrote:
>>> 
 Congratulations Ben!
 
 On Wed, May 5, 2021 at 6:48 PM Micah Kornfield <
>>> emkornfi...@gmail.com
> 
 wrote:
 
> Congrats!
> 
> On Wed, May 5, 2021 at 4:33 PM David Li 
>>> wrote:
> 
>> Congrats Ben! Well deserved.
>> 
>> Best,
>> David
>> 
>> On Wed, May 5, 2021, at 19:22, Neal Richardson wrote:
>>> Congrats Ben!
>>> 
>>> Neal
>>> 
>>> On Wed, May 5, 2021 at 4:16 PM Eduardo Ponce <
>> edponc...@gmail.com
>> > wrote:
>>> 
 Great news! Congratulations Ben.
 
 ~Eduardo
 
 
 From: Wes McKinney  wesmckinn%40gmail.com
 
 Sent: Wednesday, May 5, 2021, 7:10 PM
 To: dev
 Subject: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman
 
 The Project Management Committee (PMC) for Apache Arrow has
>> invited
 Benjamin Kietzman to become a PMC member and we are pleased to
> announce
 that Benjamin has accepted.
 
 Congratulations and welcome!
 
 
>>> 
>> 
> 
 
>> 
> 
 
>>> 
>> 



Re: [ANNOUNCE] New Arrow committer: Jonathan Keane

2021-04-29 Thread Ying Zhou
Congrats Jonathan!

> On Apr 28, 2021, at 5:20 PM, David Li  wrote:
> 
> Congrats Jonathan!
> 
> -David
> 
> On Wed, Apr 28, 2021, at 16:55, Jorge Cardoso Leitão wrote:
>> Congratulations and thank you for your contributions :)
>> 
>> On Wed, Apr 28, 2021 at 10:37 PM Neal Richardson <
>> neal.p.richard...@gmail.com > wrote:
>> 
>>> On behalf of the Arrow PMC, I'm happy to announce that Jonathan has
>>> accepted an invitation to become a committer on Apache Arrow. Welcome, and
>>> thank you for your contributions!
>>> 
>>> Neal
>>> 
>> 



Re: [ANNOUNCE] New Arrow committer: Ian Cook

2021-04-29 Thread Ying Zhou
Congrats Ian!

> On Apr 28, 2021, at 7:01 PM, paddy horan  wrote:
> 
> Congrats Ian!
> 
> 
> 
> From: Jorge Cardoso Leit?o 
> Sent: Wednesday, April 28, 2021 4:56:12 PM
> To: dev@arrow.apache.org 
> Subject: Re: [ANNOUNCE] New Arrow committer: Ian Cook
> 
> Congratulations and thank you for your contributions :)
> 
> On Wed, Apr 28, 2021 at 10:37 PM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
> 
>> On behalf of the Arrow PMC, I'm happy to announce that Ian has accepted an
>> invitation to become a committer on Apache Arrow. Welcome, and thank you
>> for your contributions!
>> 
>> Neal
>> 



Re: [ANNOUNCE] New Arrow committer: Daniël Heres

2021-04-28 Thread Ying Zhou
Congrats Daniël! 

> On Apr 28, 2021, at 9:24 AM, Andy Grove  wrote:
> 
> On behalf of the Arrow PMC, I'm happy to announce that Daniël has
> 
> accepted an invitation to become a committer on Apache Arrow.
> 
> Welcome, and thank you for your contributions!



Re: [Python] Who has been able to use PyArrow 4.0.0?

2021-04-28 Thread Ying Zhou
Well, I do have my own dev version of libarrow (with my own modifications) 
manually installed. I can verify that the pip install went smoothly on my work 
computer which has none of the Arrow development I do after work. Moreover I 
did find that ORC has been reenabled in the wheel and have used both the reader 
and the writer without issues.

As for Conda I did manage to get the pyarrow 4.0.0 but there is no ORC 
functionality since any attempt to import from pyarrow.orc lead to an error 
caused by 'pyarrow._orc isn’t found'.

Ying



> On Apr 28, 2021, at 5:06 AM, Alessandro Molina  
> wrote:
> 
> Are you sure you haven't installed `libarrow` (the CPP one) manually
> independently from pyarrow?
> 
> In your traceback you have that the symbol has not been found in
> "/usr/local/lib/libarrow.400.dylib"
> 
> But that smells like an independently installed libarrow, as the libarrow
> provided by pyarrow should exist in the pytnon environment (in my case for
> example  /usr/local/lib/python3.9/site-packages/pyarrow/libarrow.400.dylib
> ) I suspect your system installed libarrow is taking precedence over the
> one provided by pyarrow and the two might not match.
> 
> On Wed, Apr 28, 2021 at 10:05 AM Ying Zhou  wrote:
> 
>> Hi,
>> 
>> It turns out that I haven’t been able to use PyArrow 4.0.0 either in Conda
>> environments or python venvs. PyArrow does install using pip. However this
>> is what I get if I ever want to use it:
>> 
>>>>> import pyarrow as pa
>> Traceback (most recent call last):
>>  File "", line 1, in 
>>  File
>> "/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/__init__.py",
>> line 63, in 
>>import pyarrow.lib as _lib
>> ImportError:
>> dlopen(/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
>> lib.cpython-38-darwin.so, 2): Symbol not found:
>> __ZN5arrow10StopSource5tokenEv
>>  Referenced from:
>> /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
>> lib.cpython-38-darwin.so
>>  Expected in: /usr/local/lib/libarrow.400.dylib
>> in /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/
>> lib.cpython-38-darwin.so
>>>>> pa
>> Traceback (most recent call last):
>>  File "", line 1, in 
>> NameError: name 'pa' is not defined
>> 
>> On the other hand a Conda installation is not even possible. Does anyone
>> know what’s going on?
>> 
>> Ying



Re: [Python] Who has been able to use PyArrow 4.0.0?

2021-04-28 Thread Ying Zhou
In case you guys wonder I’m on MacOS 10.15.7. Due to my environment being 
pretty dirty I didn’t announce it when my verification attempt failed back then.

> On Apr 28, 2021, at 4:04 AM, Ying Zhou  wrote:
> 
> Hi,
> 
> It turns out that I haven’t been able to use PyArrow 4.0.0 either in Conda 
> environments or python venvs. PyArrow does install using pip. However this is 
> what I get if I ever want to use it:
> 
> >>> import pyarrow as pa
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/__init__.py",
>  line 63, in 
> import pyarrow.lib as _lib
> ImportError: 
> dlopen(/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so,
>  2): Symbol not found: __ZN5arrow10StopSource5tokenEv
>   Referenced from: 
> /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so
>   Expected in: /usr/local/lib/libarrow.400.dylib
>  in 
> /Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so
> >>> pa
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'pa' is not defined
> 
> On the other hand a Conda installation is not even possible. Does anyone know 
> what’s going on?
> 
> Ying 



[Python] Who has been able to use PyArrow 4.0.0?

2021-04-28 Thread Ying Zhou
Hi,

It turns out that I haven’t been able to use PyArrow 4.0.0 either in Conda 
environments or python venvs. PyArrow does install using pip. However this is 
what I get if I ever want to use it:

>>> import pyarrow as pa
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/__init__.py", 
line 63, in 
import pyarrow.lib as _lib
ImportError: 
dlopen(/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so,
 2): Symbol not found: __ZN5arrow10StopSource5tokenEv
  Referenced from: 
/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so
  Expected in: /usr/local/lib/libarrow.400.dylib
 in 
/Users/karlkatzen/anaconda3/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so
>>> pa
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'pa' is not defined

On the other hand a Conda installation is not even possible. Does anyone know 
what’s going on?

Ying 

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-27 Thread Ying Zhou
Hmm it seems that the PyArrow wheel doesn’t actually install on my Mac. Sorry I 
didn’t report any source testing issues since my environment is pretty messed 
up..

(arrowvenv) (base) karlkatzen@chloes venv % python3
Python 3.8.3 (default, Jul  2 2020, 11:26:31) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/karlkatzen/Documents/code/venv/arrowvenv/lib/python3.8/site-packages/pyarrow/__init__.py",
 line 63, in 
import pyarrow.lib as _lib
ImportError: 
dlopen(/Users/karlkatzen/Documents/code/venv/arrowvenv/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so,
 2): Symbol not found: __ZN5arrow10StopSource5tokenEv
  Referenced from: 
/Users/karlkatzen/Documents/code/venv/arrowvenv/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so
  Expected in: /usr/local/lib/libarrow.400.dylib
 in 
/Users/karlkatzen/Documents/code/venv/arrowvenv/lib/python3.8/site-packages/pyarrow/lib.cpython-38-darwin.so

What is __ZN5arrow10StopSource5tokenEv?

> On Apr 27, 2021, at 11:23 PM, Micah Kornfield  wrote:
> 
> Oh, nice, I thought they just missed the cutoff.
> 
> On Tue, Apr 27, 2021 at 8:19 PM Ying Zhou  wrote:
> 
>> They actually did.
>> 
>> Ying
>> 
>>> On Apr 27, 2021, at 11:11 PM, Micah Kornfield 
>> wrote:
>>> 
>>> Did the ORC additions actually make it into 4.0?
>>> 
>>> On Tue, Apr 27, 2021 at 7:55 PM Ying Zhou  wrote:
>>> 
>>>> Sure. I just added some info about the ORC writer. I think we need to
>>>> update the documentation in both C++ and Python as well to include ORC.
>> I
>>>> will do it.
>>>> 
>>>> Ying
>>>> 
>>>>> On Apr 27, 2021, at 5:28 PM, Neal Richardson <
>>>> neal.p.richard...@gmail.com> wrote:
>>>>> 
>>>>> 4.0 blog post is still pretty bare and could use some help filling in:
>>>>> https://github.com/apache/arrow-site/pull/104
>>>>> 
>>>>> Thanks,
>>>>> Neal
>>>>> 
>>>>> On Tue, Apr 27, 2021 at 1:55 PM Sutou Kouhei 
>> wrote:
>>>>> 
>>>>>> The remaining tasks:
>>>>>> 
>>>>>> 3.  [in-pr|Kou] upload binaries
>>>>>>  https://github.com/apache/arrow/pull/10172
>>>>>> 10. [Uwe] update conda recipes
>>>>>> 12. [in-pr|Ian] update homebrew packages
>>>>>>  https://github.com/Homebrew/homebrew-core/pull/76060
>>>>>> 
>>>>>> I updated versions on JIRA:
>>>>>> 
>>>>>> *
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>>>>>> *
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>>>>>> 
>>>>>> In 
>>>>>> "Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr 2021
>>>>>> 10:37:04 -0500,
>>>>>> Paul Taylor  wrote:
>>>>>> 
>>>>>>> JS packages have been uploaded.
>>>>>>> 
>>>>>>> Paul
>>>>>>> 
>>>>>>> On 4/27/21 9:47 AM, Neal Richardson wrote:
>>>>>>>> R package has been accepted by CRAN.
>>>>>>>> 
>>>>>>>> Neal
>>>>>>>> 
>>>>>>>> On Tue, Apr 27, 2021 at 7:25 AM Krisztián Szűcs
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I've just opened a PR with the updated documentation.
>>>>>>>>> 
>>>>>>>>> The remaining tasks:
>>>>>>>>> 
>>>>>>>>> 3.  [in-pr|Kou] upload binaries
>>>>>>>>> 6.  [Paul] upload js packages
>>>>>>>>> 10. [Uwe] update conda recipes
>>>>>>>>> 12. [todo] update homebrew packages
>>>>>>>>> 14. [Kou] update msys2
>>>>>>>>> 15. [Neal] update R packages
>>>>>>>>> 16. [in-pr|Krisztian] update docs
>>>>>&

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-27 Thread Ying Zhou
They actually did.

Ying

> On Apr 27, 2021, at 11:11 PM, Micah Kornfield  wrote:
> 
> Did the ORC additions actually make it into 4.0?
> 
> On Tue, Apr 27, 2021 at 7:55 PM Ying Zhou  wrote:
> 
>> Sure. I just added some info about the ORC writer. I think we need to
>> update the documentation in both C++ and Python as well to include ORC. I
>> will do it.
>> 
>> Ying
>> 
>>> On Apr 27, 2021, at 5:28 PM, Neal Richardson <
>> neal.p.richard...@gmail.com> wrote:
>>> 
>>> 4.0 blog post is still pretty bare and could use some help filling in:
>>> https://github.com/apache/arrow-site/pull/104
>>> 
>>> Thanks,
>>> Neal
>>> 
>>> On Tue, Apr 27, 2021 at 1:55 PM Sutou Kouhei  wrote:
>>> 
>>>> The remaining tasks:
>>>> 
>>>> 3.  [in-pr|Kou] upload binaries
>>>>   https://github.com/apache/arrow/pull/10172
>>>> 10. [Uwe] update conda recipes
>>>> 12. [in-pr|Ian] update homebrew packages
>>>>   https://github.com/Homebrew/homebrew-core/pull/76060
>>>> 
>>>> I updated versions on JIRA:
>>>> 
>>>> *
>>>> 
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>>>> *
>>>> 
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>>>> 
>>>> In 
>>>> "Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr 2021
>>>> 10:37:04 -0500,
>>>> Paul Taylor  wrote:
>>>> 
>>>>> JS packages have been uploaded.
>>>>> 
>>>>> Paul
>>>>> 
>>>>> On 4/27/21 9:47 AM, Neal Richardson wrote:
>>>>>> R package has been accepted by CRAN.
>>>>>> 
>>>>>> Neal
>>>>>> 
>>>>>> On Tue, Apr 27, 2021 at 7:25 AM Krisztián Szűcs
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I've just opened a PR with the updated documentation.
>>>>>>> 
>>>>>>> The remaining tasks:
>>>>>>> 
>>>>>>> 3.  [in-pr|Kou] upload binaries
>>>>>>> 6.  [Paul] upload js packages
>>>>>>> 10. [Uwe] update conda recipes
>>>>>>> 12. [todo] update homebrew packages
>>>>>>> 14. [Kou] update msys2
>>>>>>> 15. [Neal] update R packages
>>>>>>> 16. [in-pr|Krisztian] update docs
>>>>>>> 
>>>>>>> On Tue, Apr 27, 2021 at 2:42 PM Krisztián Szűcs
>>>>>>>  wrote:
>>>>>>>> On Tue, Apr 27, 2021 at 2:21 PM Paul Taylor <
>> ptaylor.apa...@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>>>> These look like the errors resolved in
>>>>>>>>> https://github.com/apache/arrow/pull/10156. Can we cherry-pick
>> that
>>>>>>>>> commit to the release branch?
>>>>>>>> Great, I'll cherry-pick that commit.
>>>>>>>> 
>>>>>>>> Could you please release the JS packages to npm? I think the
>>>>>>>> lerna.json needs to be updated before npm publish.
>>>>>>>> 
>>>>>>>> Thank Paul!
>>>>>>>>> 
>>>>>>>>> On 4/27/21 7:04 AM, Krisztián Szűcs wrote:
>>>>>>>>>> I'd need some help to both release the JS packages using the new
>>>>>>> lerna
>>>>>>>>>> configuration and to fix the JS documentation generation [1]. We
>>>>>>>>>> should backport these changes to the release-4.0.0 branch.
>>>>>>>>>> 
>>>>>>>>>> [1]:
>>>>>>> 
>>>> 
>> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4297=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181
>>>>>>>>>> On Tue, Apr 27, 2021 at 1:50 AM Sutou Kouhei 
>>>>>>> wrote:
>>>>>>>>>>> I'll also update MSYS2 packages:
>>>>>>>>>>> 
>>>>>>>>>>> 1.  [x] open a pull request to bump the versio

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-27 Thread Ying Zhou
Sure. I just added some info about the ORC writer. I think we need to update 
the documentation in both C++ and Python as well to include ORC. I will do it.

Ying

> On Apr 27, 2021, at 5:28 PM, Neal Richardson  
> wrote:
> 
> 4.0 blog post is still pretty bare and could use some help filling in:
> https://github.com/apache/arrow-site/pull/104
> 
> Thanks,
> Neal
> 
> On Tue, Apr 27, 2021 at 1:55 PM Sutou Kouhei  wrote:
> 
>> The remaining tasks:
>> 
>> 3.  [in-pr|Kou] upload binaries
>>https://github.com/apache/arrow/pull/10172
>> 10. [Uwe] update conda recipes
>> 12. [in-pr|Ian] update homebrew packages
>>https://github.com/Homebrew/homebrew-core/pull/76060
>> 
>> I updated versions on JIRA:
>> 
>>  *
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Markingthereleasedversionas%22RELEASED%22onJIRA
>>  *
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-StartingthenewversiononJIRA
>> 
>> In 
>>  "Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr 2021
>> 10:37:04 -0500,
>>  Paul Taylor  wrote:
>> 
>>> JS packages have been uploaded.
>>> 
>>> Paul
>>> 
>>> On 4/27/21 9:47 AM, Neal Richardson wrote:
 R package has been accepted by CRAN.
 
 Neal
 
 On Tue, Apr 27, 2021 at 7:25 AM Krisztián Szűcs
 
 wrote:
 
> I've just opened a PR with the updated documentation.
> 
> The remaining tasks:
> 
> 3.  [in-pr|Kou] upload binaries
> 6.  [Paul] upload js packages
> 10. [Uwe] update conda recipes
> 12. [todo] update homebrew packages
> 14. [Kou] update msys2
> 15. [Neal] update R packages
> 16. [in-pr|Krisztian] update docs
> 
> On Tue, Apr 27, 2021 at 2:42 PM Krisztián Szűcs
>  wrote:
>> On Tue, Apr 27, 2021 at 2:21 PM Paul Taylor >> 
> wrote:
>>> These look like the errors resolved in
>>> https://github.com/apache/arrow/pull/10156. Can we cherry-pick that
>>> commit to the release branch?
>> Great, I'll cherry-pick that commit.
>> 
>> Could you please release the JS packages to npm? I think the
>> lerna.json needs to be updated before npm publish.
>> 
>> Thank Paul!
>>> 
>>> On 4/27/21 7:04 AM, Krisztián Szűcs wrote:
 I'd need some help to both release the JS packages using the new
> lerna
 configuration and to fix the JS documentation generation [1]. We
 should backport these changes to the release-4.0.0 branch.
 
 [1]:
> 
>> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4297=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181
 On Tue, Apr 27, 2021 at 1:50 AM Sutou Kouhei 
> wrote:
> I'll also update MSYS2 packages:
> 
> 1.  [x] open a pull request to bump the version numbers in the
> source code
> 2.  [x] upload source
> 3.  [kou] upload binaries
> 4.  [x] update website
> 5.  [x] upload ruby gems
> 6.  [ ] upload js packages
> 8.  [x] upload C# packages
> 9.  [x] upload rust crates
> 10. [ ] update conda recipes
> 11. [x] upload wheels/sdist to pypi
> 12. [ ] update homebrew packages
> 13. [x] update maven artifacts
> 14. [kou] update msys2
> 15. [nealrichardson] update R packages
> 16. [ ] update docs
> 
> In  r...@mail.gmail.com>
>"Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr
> 2021 01:48:37 +0200,
>Krisztián Szűcs  wrote:
> 
>> On Tue, Apr 27, 2021 at 1:05 AM Andy Grove >> 
> wrote:
>>> The following Rust crates have been published: arrow,
> arrow-flight, parquet, parquet_derive, datafusion
>> Thanks Andy!
>> 
>> The current status is:
>> 1.  [x] open a pull request to bump the version numbers in the
> source code
>> 2.  [x] upload source
>> 3.  [kou] upload binaries
>> 4.  [x] update website
>> 5.  [x] upload ruby gems
>> 6.  [ ] upload js packages
>> 8.  [x] upload C# packages
>> 9.  [x] upload rust crates
>> 10. [ ] update conda recipes
>> 11. [x] upload wheels/sdist to pypi
>> 12. [ ] update homebrew packages
>> 13. [x] update maven artifacts
>> 14. [ ] update msys2
>> 15. [nealrichardson] update R packages
>> 16. [ ] update docs
>>> On Mon, Apr 26, 2021 at 4:34 PM Andy Grove <
>> andygrov...@gmail.com>
> wrote:
 Yes, I can handle the Rust release.
 
 On Mon, Apr 26, 2021, 4:17 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com> wrote:
> @Andy Grove could you please handle the rust release?
> 
> On Mon, Apr 26, 2021 at 11:51 PM Krisztián Szűcs
>  wrote:
>> 1.  

[C++][Python] ORC in pyarrow wheels?

2021-04-19 Thread Ying Zhou
Hi,

First of all I’d like to thank Antoine, Micah, Sutou and Uwe for reviewing and 
improving/helping me improve the Arrow2ORC adapter which Antoine merged into 
master earlier today! This community is really great!

Now that we have the Arrow2ORC adapter ready I found that those who don’t use 
Conda pretty much can not benefit from either the ORC2Arrow or the Arrow2ORC 
adapter because ORC is not available in Pyarrow wheels due to statically linked 
protobuf: 
https://github.com/apache/arrow/commit/102acc47287c37a01ac11a5cb6bd1da3f1f0712d 

The main Jira ticket is https://issues.apache.org/jira/browse/ARROW-7811

Can anyone please tell me more about it? Thanks! I want to fix this.

Thanks,
Ying

Re: 4.0 release preparation

2021-04-14 Thread Ying Zhou
Thanks for your feedback last night all of which has been addressed! I will 
address all comments at least daily so that the ORC writer can be included 
without blocking the release!

Ying

> On Apr 13, 2021, at 9:07 PM, Micah Kornfield  wrote:
> 
> I'll try to take a look tonight but it might be tight if there is
> substantial feedback.
> 
> On Tue, Apr 13, 2021 at 5:29 PM Wes McKinney  wrote:
> 
>> I agree it would be good to get the ORC writer in this release.
>> 
>> On Tue, Apr 13, 2021 at 7:20 PM Ying Zhou  wrote:
>>> 
>>> What about this one? You know, the C++/Python ORC writer.
>> https://github.com/apache/arrow/pull/8648 <
>> https://github.com/apache/arrow/pull/8648>
>>> 
>>> Ying
>>>> On Apr 13, 2021, at 12:52 PM, Neal Richardson <
>> neal.p.richard...@gmail.com> wrote:
>>>> 
>>>> I think we're getting close to closing out 4.0. Let's give until the
>> end of
>>>> Wednesday to get any outstanding pull requests reviewed and merged, and
>>>> Krisztián will plan to cut a release candidate on Thursday.
>>>> 
>>>> If there are any objections, please speak up!
>>>> 
>>>> Neal
>>>> 
>>>> On Sun, Apr 11, 2021 at 9:15 AM Antoine Pitrou 
>> wrote:
>>>> 
>>>>> 
>>>>> Le 10/04/2021 à 23:06, Weston Pace a écrit :
>>>>>> Nightly build triage (based on nightly builds from 4/9):
>>>>>> 
>>>>>> Failed Tasks:
>>>>>> - conda-linux-gcc-py36-aarch64:
>>>>>>  ARROW-12324 (conda builds timing out, conda slow)
>>>>>> - conda-linux-gcc-py37-aarch64:
>>>>>>  ARROW-12324 (conda builds timing out, conda slow)
>>>>>> - conda-osx-clang-py37-r40:
>>>>>>  Appears to have been an intermittent Azure availability error on
>>>>>> OSX.  Azure did not report a degradation at the time but they have
>>>>>> warned of longer than usual queue times for OSX.  Last several runs
>> of
>>>>>> this have passed.
>>>>>> - gandiva-jar-ubuntu:
>>>>>>  ARROW-12325 (Bug, PR submitted Friday)
>>>>>> - test-conda-cpp-valgrind:
>>>>>>  ARROW-12320 (Bug in CI configuration, PR submitted and merged.
>>>>>> However, I believe this may have been masking an actual error, should
>>>>>> check this next nightly run)
>>>>> 
>>>>> Please take a look at https://github.com/apache/arrow/pull/9839
>>>>> 
>>> 
>> 



Re: 4.0 release preparation

2021-04-13 Thread Ying Zhou
What about this one? You know, the C++/Python ORC writer. 
https://github.com/apache/arrow/pull/8648 


Ying
> On Apr 13, 2021, at 12:52 PM, Neal Richardson  
> wrote:
> 
> I think we're getting close to closing out 4.0. Let's give until the end of
> Wednesday to get any outstanding pull requests reviewed and merged, and
> Krisztián will plan to cut a release candidate on Thursday.
> 
> If there are any objections, please speak up!
> 
> Neal
> 
> On Sun, Apr 11, 2021 at 9:15 AM Antoine Pitrou  wrote:
> 
>> 
>> Le 10/04/2021 à 23:06, Weston Pace a écrit :
>>> Nightly build triage (based on nightly builds from 4/9):
>>> 
>>> Failed Tasks:
>>> - conda-linux-gcc-py36-aarch64:
>>>   ARROW-12324 (conda builds timing out, conda slow)
>>> - conda-linux-gcc-py37-aarch64:
>>>   ARROW-12324 (conda builds timing out, conda slow)
>>> - conda-osx-clang-py37-r40:
>>>   Appears to have been an intermittent Azure availability error on
>>> OSX.  Azure did not report a degradation at the time but they have
>>> warned of longer than usual queue times for OSX.  Last several runs of
>>> this have passed.
>>> - gandiva-jar-ubuntu:
>>>   ARROW-12325 (Bug, PR submitted Friday)
>>> - test-conda-cpp-valgrind:
>>>   ARROW-12320 (Bug in CI configuration, PR submitted and merged.
>>> However, I believe this may have been masking an actual error, should
>>> check this next nightly run)
>> 
>> Please take a look at https://github.com/apache/arrow/pull/9839
>> 



Re: [C++] Complex type traits

2021-04-12 Thread Ying Zhou
Thanks! I have picked option 2 since it is not really general.

Ying

> On Apr 12, 2021, at 12:00 AM, Micah Kornfield  wrote:
> 
> I think there are three options:
> 1.  Add it in with the existing type_traits code.
> 2.  Make the definition only in the code you are working on (I'm not sure
> if this would be generatelly applicable?)
> 3.  Don't use enable_if_* traits but instead use your own enable_if and
> have an "OR" expression using the underlying type_check checks (e.g.
> is_numeric_type) [1]
> 
> [1]
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/type_traits.h#L491
> 
> On Sun, Apr 11, 2021 at 8:56 PM Ying Zhou  wrote:
> 
>> Hi,
>> 
>> I would like to have a variant of arrow::enable_if_number good
>> for numerical types, boolean as well as Date32 but not any other type so
>> that I don’t have to repeat template specializations with essentially the
>> same code. What’s the canonical way to achieve that?
>> 
>> Ying



[C++] Complex type traits

2021-04-11 Thread Ying Zhou
Hi,

I would like to have a variant of arrow::enable_if_number good for 
numerical types, boolean as well as Date32 but not any other type so that I 
don’t have to repeat template specializations with essentially the same code. 
What’s the canonical way to achieve that?

Ying

[C++] Obtain shared_ptr of an Array from a reference

2021-03-22 Thread Ying Zhou
Hi,

I know this is a very silly question here but I still prefer to see it resolved 
rather than working on it for a day:

How shall I generate an std::shared_ptr from an Array&? Just taking the 
address and constructing a shared_ptr from the pointer doesn’t work.

Ying

[C++] Fastest method to create an Array based on an existing Array with null_bitmap changed?

2021-03-20 Thread Ying Zhou
Hi,

I would like to generate an Array using an existing Array with data preserved 
with the exception of null_bitmap. Shall I use Array::SetData and 
ArrayData::Make (with dictionary and child_data)?

Ying

[C++][Python] Do I have to make arrow::adapters::orc::WriterOptions and arrow::adapters::orc::ReaderOptions immutable?

2021-03-14 Thread Ying Zhou
Hi,

I have a question about https://github.com/apache/arrow/pull/9702 
 (and another future PR) over 
WriterOptions and ReaderOptions that are basically code copied from the ORC 
project which then got Arrowized so that the names are acceptable.

After going over Cython code for Parquet I began to wonder whether I have to 
make the following changes:
1. Add ‘Orc’ to the beginning of all the enums and classes I adapted from the 
ORC project e.g. arrow::adapters::orc::OrcCompressionKind instead of 
arrow::adapters::orc::CompressionKind.
2. Make arrow::adapters::orc::(Orc)ReaderOptions and 
arrow::adapters::orc::(Orc)WriterOptions immutable and instead perform all the 
mutations in their respective builder types.

Ying

[Python] Best practices when exposing options

2021-03-12 Thread Ying Zhou
Hi,

Currently I’m working on ARROW-11297 
https://github.com/mathyingzhou/arrow/tree/ARROW-11297 
) which will be filed 
as soon as the current PR is merged. 

I managed to reimplement orc::WriterOptions in Arrow (with naming conventions 
Arrow-ized) as arrow::adapters::orc::WriterOptions (which is necessary since we 
do not allow third party headers to be included in our public headers) and 
finished the C++ part of the work. Now I’m trying to expose WriterOptions in 
Python. I do wonder how this is supposed to be done in general. After reading 
the code in array.pxi I think maybe this is the way I want to do it:

1. The end user will see individual ORC writer options (e.g. CompressionKind, 
that is, whether we use ZLIB, LZ0 or some other form of compression or none at 
all) as keyword arguments.
2. These keyword arguments will be processed in _orc.pyx first as a dictionary 
and then using an adapter they will be converted into an 
arrow::adapters::orc::WriterOptions. 

Is this the right way?

Moreover I do wonder how we should convert the enums. Shall I use a series of 
if/elif or a mapping dict to force people to use one of the correct strings or 
get a ValueError?

e.g.

compression_kind_mapping = {’snappy’:CompressionKind._CompressionKind_SNAPPY,

’zl0’:CompressionKind._CompressionKind_ZL0}} #There are other options, this is 
just an example
If compression_kind not in compression_kind_mapping.keys():
raise ValueError(“Unknown compression_kind”)
c_compression_kind = compression_kind_mapping[compression_kind]

Ying

[C++][CMake] How to add new .cc & .h files?

2021-03-08 Thread Ying Zhou
Hi,

Right now I’m working on ARROW-11297 which adds WriterOptions to the ORCWriter 
(and will be a separate PR). After adding new files (adapter_options.h & 
adapter_options.cc ) I found that 
src/arrow/CMakeFiles/arrow_objlib.dir/adapters/orc/adapter_options.cc.o doesn’t 
exist and as a result ld obviously can not find the symbols that should be 
contained in it even though I use ARROW_EXPORT to indicate that 
arrow::adapters::orc::WriterOptions does need to be exported. I wonder whether 
there is some setting in CMake I should change in order to fix that.

Here is the error message:

[ 89%] Linking CXX executable ../../../../debug/arrow-orc-adapter-test
Undefined symbols for architecture x86_64:
  "arrow::adapters::orc::WriterOptions::WriterOptions()", referenced from:
  arrow::AssertTableWriteReadEqual(std::__1::shared_ptr 
const&, std::__1::shared_ptr const&, long long) in 
adapter_test.cc.o
  "arrow::adapters::orc::WriterOptions::~WriterOptions()", referenced from:
  arrow::AssertTableWriteReadEqual(std::__1::shared_ptr 
const&, std::__1::shared_ptr const&, long long) in 
adapter_test.cc.o
  "arrow::adapters::orc::AdaptWriterOptions(arrow::adapters::orc::WriterOptions 
const&)", referenced from:
  arrow::adapters::orc::ORCFileWriter::Impl::Write(arrow::Table const&, 
arrow::adapters::orc::WriterOptions const&) in libarrow.a(adapter.cc.o)
ld: symbol(s) not found for architecture x86_64
clang-11: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[3]: *** [debug/arrow-orc-adapter-test] Error 1
make[2]: *** [src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/all] 
Error 2
make[1]: *** [CMakeFiles/unittest.dir/rule] Error 2
make: *** [unittest] Error 2


Ying

[GLib][Ruby] Testing issues

2021-03-06 Thread Ying Zhou
Hi,

As work in C++ inevitably affects C GLib and Ruby it is necessary for me to be 
able to test them locally. I followed instructions here for Macs. Arrow GLib 
for developers was installed. However I can not run GLib tests with bundle exec 
test/run-test.sh Looks like there might be some path problem.

Here is the error message I got. Does anyone know what the problem is? (In case 
you wonder,  '/usr/local/lib/libparquet.400.dylib’ does exist.

(NULL)-WARNING **: Failed to load shared library 
'/usr/local/lib/libparquet-glib.400.dylib' referenced by the typelib: 
dlopen(/usr/local/lib/libparquet-glib.400.dylib, 0x0009): dependent dylib 
'@rpath/libparquet.400.dylib' not found for 
'/usr/local/lib/libparquet-glib.400.dylib'. relative file paths not allowed 
'@rpath/libparquet.400.dylib'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/loader.rb:215:in
 `load_object_info'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/loader.rb:68:in
 `load_info'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/loader.rb:43:in
 `block in load'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/repository.rb:34:in
 `block (2 levels) in each'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/repository.rb:33:in
 `times'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/repository.rb:33:in
 `block in each'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/repository.rb:32:in
 `each'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/repository.rb:32:in
 `each'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection/loader.rb:42:in
 `load'
from 
/Library/Ruby/Gems/2.6.0/gems/gobject-introspection-3.4.3/lib/gobject-introspection.rb:44:in
 `load'
from 
/Users/karlkatzen/Documents/code/arrow-dev/arrow/c_glib/test/run-test.rb:60:in 
`'
Traceback (most recent call last):
17: from 
/Users/karlkatzen/Documents/code/arrow-dev/arrow/c_glib/test/run-test.rb:80:in 
`'
16: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/autorunner.rb:66:in 
`run'
15: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/autorunner.rb:434:in
 `run'
14: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/autorunner.rb:106:in
 `block in '
13: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:38:in
 `collect'
12: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:136:in
 `add_load_path'
11: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:43:in
 `block in collect'
10: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:43:in
 `each'
 9: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:46:in
 `block (2 levels) in collect'
 8: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:85:in
 `collect_recursive'
 7: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:85:in
 `each'
 6: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:87:in
 `block in collect_recursive'
 5: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:112:in
 `collect_file'
 4: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:136:in
 `add_load_path'
 3: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:114:in
 `block in collect_file'
 2: from 
/Library/Ruby/Gems/2.6.0/gems/test-unit-3.4.0/lib/test/unit/collector/load.rb:114:in
 `require'
 1: from 
/Users/karlkatzen/Documents/code/arrow-dev/arrow/c_glib/test/test-extension-data-type.rb:18:in
 `'
/Users/karlkatzen/Documents/code/arrow-dev/arrow/c_glib/test/test-extension-data-type.rb:19:in
 `': uninitialized constant Arrow::ExtensionArray 
(NameError)

Moreover I can not find any way to install Red Arrow for development. Please 
let me know how that can be done. Thanks again!

Ying

[C++] Generating random Date64 & Timestamp arrays

2021-03-03 Thread Ying Zhou
Hi,

I’d like to generate random Date64 & Timestamp arrays with artificial max and 
mins. RandomArrayGenerator::ArrayOf in arrow/testing/random.h does not help. 
Currently the approach I’d like to take is using RandomArrayGenerator::Int64 to 
generate a random int64 array and then convert it to a date64/timestamp array 
through some form of reinterpretation at ArrayData level. Does that work? If so 
is it the best approach? Thanks!

Ying

Re: [VOTE] Allow source-only release vote for patch releases

2021-02-27 Thread Ying Zhou
+1 (non-binding)

> On Feb 27, 2021, at 11:19 AM, Neal Richardson  
> wrote:
> 
> We've had some discussion about ways to reduce the cost of releasing and
> ways to allow maintainers of subprojects to make more frequent maintenance
> releases. In [1] we proposed allowing maintenance/patch releases on which
> we vote only to approve the source package, unlike our quarterly major
> releases, where we vote on the source and on most binary packages as well.
> Maintainers of the various implementations and subprojects may choose to
> build and publish binary artifacts from these patch release sources after
> the release vote, if there are relevant bug fixes in the patch release.
> This procedure will allow us to make patch releases more easily, and we
> maintain our shared mapping between a GitHub tag/commit and a release
> number across all subprojects.
> 
> Please vote whether to adopt the patch release procedure. The vote will be
> open for at least 72 hours.
> 
> [ ] +1 Allow source-only patch release votes
> [ ] +0
> [ ] -1 Do not allow source-only patch release votes because...
> 
> Here is my vote: +1
> 
> Thanks,
> Neal
> 
> [1]:
> https://lists.apache.org/thread.html/r0ff484bcf8e410730ddcba447ff0610e7138f16d035c43a4015da187%40%3Cdev.arrow.apache.org%3E



[C++] Breakpoints and VSCode integration

2021-02-25 Thread Ying Zhou
Hi,

To facilitate faster debugging I’d like to integrate make unittest debugging 
into VSCode (on Mac) so that when I run a test that might show some bugs 
breakpoints can stop the execution so that I can dig around a bit. Does anyone 
know how that can be done? I know it is a stupid question but it does need to 
be addressed so that I can finish the ORC writer with visitors ASAP.

Thanks,
Ying

[C++] The best method to pass null from struct to its children & visitors

2021-02-18 Thread Ying Zhou
Hi,

Now I’m working on fixing the last concerns on my ORC writer 
https://github.com/apache/arrow/pull/8648 
 and have two questions. 

I have a need to standardize an Arrow Array so that it is fit for cheaper 
conversion into ORC by making sure that all the children (and grandchildren 
etc) of null struct entries are null. Is there an established method to achieve 
that? It will also be very helpful if there Is some fast and canonical method 
to standardize an Array and ensure that null List/LargeList/FixedSizeList/Map 
entries have zero lengths in their value/key/item arrays.

I’m about to switch all my Write*Batch to use ArrayDataInlineVisitor (or maybe 
ArrayDataVisitor since it is used more often?) I have a concern on feasibility 
of using visitors for nested types. It doesn’t seem like ArrayDataVisitor 
supports these types. Is that true? If so, shall I use visitors for non-nested 
types while using for loops for nested ones?

Thanks,
Ying

Re: [C++] Why are these two tables unequal?

2021-02-10 Thread Ying Zhou
Yup. That doesn’t change anything. I have just pushed this to 
https://github.com/apache/arrow/pull/8648 
 . Please take a look. Really thanks!

TEST(TestAdapterWriteNested, writeList) {
  std::shared_ptr table_schema = schema({field("list", list(int32()))});
  int64_t num_rows = 1;
  arrow::random::RandomArrayGenerator rand(kRandomSeed);
  auto value_array = rand.ArrayOf(int32(), 5 * num_rows, 0.6);
  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 0.8);
  std::shared_ptr chunked_array = 
std::make_shared(array);
  std::shared_ptr table = Table::Make(table_schema, {chunked_array});

  std::shared_ptr buffer_output_stream =
  io::BufferOutputStream::Create(kDefaultSmallMemStreamSize * 
15).ValueOrDie();
  std::unique_ptr writer =
  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
  ARROW_EXPECT_OK(writer->Write(*table));
  ARROW_EXPECT_OK(writer->Close());
  std::shared_ptr buffer = buffer_output_stream->Finish().ValueOrDie();
  std::shared_ptr in_stream(new io::BufferReader(buffer));
  std::unique_ptr reader;
  ARROW_EXPECT_OK(
  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
));
  std::shared_ptr actual_output_table;
  ARROW_EXPECT_OK(reader->Read(_output_table));
  auto actual_array =
  
std::static_pointer_cast(actual_output_table->column(0)->chunk(0));
  auto expected_array = 
std::static_pointer_cast(table->column(0)->chunk(0));
  AssertArraysEqual(*(actual_array->offsets()), *(expected_array->offsets()));
  AssertArraysEqual(*(actual_array->values()), *(expected_array->values()));
  AssertBufferEqual(*(actual_array->null_bitmap()), 
*(expected_array->null_bitmap()));
  ASSERT_TRUE(actual_array->type()->Equals(*(expected_array->type()), true));
  RecordProperty("output_type", actual_array->type()->ToString());
  RecordProperty("input_type", expected_array->type()->ToString());
  RecordProperty("array_equality", actual_array->Equals(*expected_array));
}








> On Feb 10, 2021, at 12:43 PM, Antoine Pitrou  wrote:
> 
> check_metadata = true



Re: [C++] Why are these two tables unequal?

2021-02-10 Thread Ying Zhou
Not really. So what’s really going on?!

TEST(TestAdapterWriteNested, writeList) {
  std::shared_ptr table_schema = schema({field("list", list(int32()))});
  int64_t num_rows = 1;
  arrow::random::RandomArrayGenerator rand(kRandomSeed);
  auto value_array = rand.ArrayOf(int32(), 5 * num_rows, 0.6);
  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 0.8);
  std::shared_ptr chunked_array = 
std::make_shared(array);
  std::shared_ptr table = Table::Make(table_schema, {chunked_array});

  std::shared_ptr buffer_output_stream =
  io::BufferOutputStream::Create(kDefaultSmallMemStreamSize * 
15).ValueOrDie();
  std::unique_ptr writer =
  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
  ARROW_EXPECT_OK(writer->Write(*table));
  ARROW_EXPECT_OK(writer->Close());
  std::shared_ptr buffer = buffer_output_stream->Finish().ValueOrDie();
  std::shared_ptr in_stream(new io::BufferReader(buffer));
  std::unique_ptr reader;
  ARROW_EXPECT_OK(
  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
));
  std::shared_ptr actual_output_table;
  ARROW_EXPECT_OK(reader->Read(_output_table));
  auto actual_array =
  
std::static_pointer_cast(actual_output_table->column(0)->chunk(0));
  auto expected_array = 
std::static_pointer_cast(table->column(0)->chunk(0));
  AssertArraysEqual(*(actual_array->offsets()), *(expected_array->offsets()));
  AssertArraysEqual(*(actual_array->values()), *(expected_array->values()));
  AssertBufferEqual(*(actual_array->null_bitmap()), 
*(expected_array->null_bitmap()));
  ASSERT_TRUE(actual_array->type()->Equals(*(expected_array->type(;
  RecordProperty("output_type", actual_array->type()->ToString());
  RecordProperty("input_type", expected_array->type()->ToString());
  RecordProperty("array_equality", actual_array->Equals(*expected_array));
}








> On Feb 10, 2021, at 12:10 PM, Antoine Pitrou  wrote:
> 
> 
> Hmm, perhaps the types are unequal, then.  Can you print them out
> (including field metadata)?
> 
> 
> Le 10/02/2021 à 18:03, Ying Zhou a écrit :
>> Thanks! Now we have an even weirder phenomenon. Even the null bitmaps and 
>> offsets are equal. However the arrays aren’t! Does anyone know why?
>> 
>> TEST(TestAdapterWriteNested, writeList) {
>>  std::shared_ptr table_schema = schema({field("list", 
>> list(int32()))});
>>  int64_t num_rows = 1;
>>  arrow::random::RandomArrayGenerator rand(kRandomSeed);
>>  auto value_array = rand.ArrayOf(int32(), 5 * num_rows, 0.6);
>>  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 0.8);
>>  std::shared_ptr chunked_array = 
>> std::make_shared(array);
>>  std::shared_ptr table = Table::Make(table_schema, {chunked_array});
>> 
>>  std::shared_ptr buffer_output_stream =
>>  io::BufferOutputStream::Create(kDefaultSmallMemStreamSize * 
>> 15).ValueOrDie();
>>  std::unique_ptr writer =
>>  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
>>  ARROW_EXPECT_OK(writer->Write(*table));
>>  ARROW_EXPECT_OK(writer->Close());
>>  std::shared_ptr buffer = 
>> buffer_output_stream->Finish().ValueOrDie();
>>  std::shared_ptr in_stream(new 
>> io::BufferReader(buffer));
>>  std::unique_ptr reader;
>>  ARROW_EXPECT_OK(
>>  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
>> ));
>>  std::shared_ptr actual_output_table;
>>  ARROW_EXPECT_OK(reader->Read(_output_table));
>>  auto actual_array =
>>  
>> std::static_pointer_cast(actual_output_table->column(0)->chunk(0));
>>  auto expected_array = 
>> std::static_pointer_cast(table->column(0)->chunk(0));
>>  AssertArraysEqual(*(actual_array->offsets()), *(expected_array->offsets()));
>>  AssertArraysEqual(*(actual_array->values()), *(expected_array->values()));
>>  AssertBufferEqual(*(actual_array->null_bitmap()), 
>> *(expected_array->null_bitmap()));
>>  RecordProperty("array_equality", actual_array->Equals(*expected_array));
>> }
>> 
>>> timestamp="2021-02-10T11:58:23" classname="TestAdapterWriteNested">
>> 
>> 
>> 
>>
>> 
>>> On Feb 10, 2021, at 3:52 AM, Antoine Pitrou  wrote:
>>> 
>>> 
>>> Hi Ying,
>>> 
>>> Hmm, yes, this may be related to the null bitmaps, or the offsets.
>>> Can you try to inspect or pretty-print the offsets arrays for the two
>>> list arrays?
>>> 
>>> Regards
>>> 
>>> Antoine.
>>> 
>>> 
>&g

Re: [C++] Why are these two tables unequal?

2021-02-10 Thread Ying Zhou
Thanks! Now we have an even weirder phenomenon. Even the null bitmaps and 
offsets are equal. However the arrays aren’t! Does anyone know why?

TEST(TestAdapterWriteNested, writeList) {
  std::shared_ptr table_schema = schema({field("list", list(int32()))});
  int64_t num_rows = 1;
  arrow::random::RandomArrayGenerator rand(kRandomSeed);
  auto value_array = rand.ArrayOf(int32(), 5 * num_rows, 0.6);
  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 0.8);
  std::shared_ptr chunked_array = 
std::make_shared(array);
  std::shared_ptr table = Table::Make(table_schema, {chunked_array});

  std::shared_ptr buffer_output_stream =
  io::BufferOutputStream::Create(kDefaultSmallMemStreamSize * 
15).ValueOrDie();
  std::unique_ptr writer =
  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
  ARROW_EXPECT_OK(writer->Write(*table));
  ARROW_EXPECT_OK(writer->Close());
  std::shared_ptr buffer = buffer_output_stream->Finish().ValueOrDie();
  std::shared_ptr in_stream(new io::BufferReader(buffer));
  std::unique_ptr reader;
  ARROW_EXPECT_OK(
  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
));
  std::shared_ptr actual_output_table;
  ARROW_EXPECT_OK(reader->Read(_output_table));
  auto actual_array =
  
std::static_pointer_cast(actual_output_table->column(0)->chunk(0));
  auto expected_array = 
std::static_pointer_cast(table->column(0)->chunk(0));
  AssertArraysEqual(*(actual_array->offsets()), *(expected_array->offsets()));
  AssertArraysEqual(*(actual_array->values()), *(expected_array->values()));
  AssertBufferEqual(*(actual_array->null_bitmap()), 
*(expected_array->null_bitmap()));
  RecordProperty("array_equality", actual_array->Equals(*expected_array));
}







> On Feb 10, 2021, at 3:52 AM, Antoine Pitrou  wrote:
> 
> 
> Hi Ying,
> 
> Hmm, yes, this may be related to the null bitmaps, or the offsets.
> Can you try to inspect or pretty-print the offsets arrays for the two
> list arrays?
> 
> Regards
> 
> Antoine.
> 
> 
> Le 10/02/2021 à 03:26, Ying Zhou a écrit :
>> Hi,
>> 
>> This is an extremely weird phenomenon. There are two 2*1 tables that are 
>> supposedly different when I got a confusing error message like this:
>> 
>> [ RUN  ] TestAdapterWriteNested.writeList
>> /Users/karlkatzen/Documents/code/arrow-dev/arrow/cpp/src/arrow/testing/gtest_util.cc:459:
>>  Failure
>> Failed
>> Unequal at absolute position 2
>> Expected:
>>  [
>>[
>>  null,
>>  1074834796,
>>  null,
>>  null
>>],
>>null
>>  ]
>> Actual:
>>  [
>>[
>>  null,
>>  1074834796,
>>  null,
>>  null
>>],
>>null
>>  ]
>> [  FAILED  ] TestAdapterWriteNested.writeList (2 ms)
>> 
>> Here is the code that causes the issue:
>> 
>> TEST(TestAdapterWriteNested, writeList) {
>>  std::shared_ptr table_schema = schema({field("list", 
>> list(int32()))});
>>  int64_t num_rows = 2;
>>  arrow::random::RandomArrayGenerator rand(kRandomSeed);
>>  auto value_array = rand.ArrayOf(int32(), 2 * num_rows, 0.6);
>>  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 1);
>>  std::shared_ptr chunked_array = 
>> std::make_shared(array);
>>  std::shared_ptr table = Table::Make(table_schema, {chunked_array});
>>  AssertTableWriteReadEqual(table, table, kDefaultSmallMemStreamSize * 5);
>> }
>> 
>> Here AssertTableWriteReadEqual is a function I use to test that 
>> from_orc(to_orc(table_in)) == expected_table_out. The function did not have 
>> issues before.
>> 
>> void AssertTableWriteReadEqual(const std::shared_ptr& input_table,
>>   const std::shared_ptr& 
>> expected_output_table,
>>   const int64_t max_size = 
>> kDefaultSmallMemStreamSize) {
>>  std::shared_ptr buffer_output_stream =
>>  io::BufferOutputStream::Create(max_size).ValueOrDie();
>>  std::unique_ptr writer =
>>  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
>>  ARROW_EXPECT_OK(writer->Write(*input_table));
>>  ARROW_EXPECT_OK(writer->Close());
>>  std::shared_ptr buffer = 
>> buffer_output_stream->Finish().ValueOrDie();
>>  std::shared_ptr in_stream(new 
>> io::BufferReader(buffer));
>>  std::unique_ptr reader;
>>  ARROW_EXPECT_OK(
>>  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
>> ));
>>  std::shared_ptr actual_output_table;
>>  ARROW_EXPECT_OK(reader->Read(_output_table));
>>  AssertTablesEqual(*actual_output_table, *expected_output_table, false, 
>> false);
>> }
>> 
>> I strongly suspect that this is related to the null bitmaps. What do you 
>> guys think?
>> 
>> Ying
>> 



[C++] Why are these two tables unequal?

2021-02-09 Thread Ying Zhou
Hi,

This is an extremely weird phenomenon. There are two 2*1 tables that are 
supposedly different when I got a confusing error message like this:

[ RUN  ] TestAdapterWriteNested.writeList
/Users/karlkatzen/Documents/code/arrow-dev/arrow/cpp/src/arrow/testing/gtest_util.cc:459:
 Failure
Failed
Unequal at absolute position 2
Expected:
  [
[
  null,
  1074834796,
  null,
  null
],
null
  ]
Actual:
  [
[
  null,
  1074834796,
  null,
  null
],
null
  ]
[  FAILED  ] TestAdapterWriteNested.writeList (2 ms)

Here is the code that causes the issue:

TEST(TestAdapterWriteNested, writeList) {
  std::shared_ptr table_schema = schema({field("list", list(int32()))});
  int64_t num_rows = 2;
  arrow::random::RandomArrayGenerator rand(kRandomSeed);
  auto value_array = rand.ArrayOf(int32(), 2 * num_rows, 0.6);
  std::shared_ptr array = rand.List(*value_array, num_rows + 1, 1);
  std::shared_ptr chunked_array = 
std::make_shared(array);
  std::shared_ptr table = Table::Make(table_schema, {chunked_array});
  AssertTableWriteReadEqual(table, table, kDefaultSmallMemStreamSize * 5);
}

Here AssertTableWriteReadEqual is a function I use to test that 
from_orc(to_orc(table_in)) == expected_table_out. The function did not have 
issues before.

void AssertTableWriteReadEqual(const std::shared_ptr& input_table,
   const std::shared_ptr& 
expected_output_table,
   const int64_t max_size = 
kDefaultSmallMemStreamSize) {
  std::shared_ptr buffer_output_stream =
  io::BufferOutputStream::Create(max_size).ValueOrDie();
  std::unique_ptr writer =
  adapters::orc::ORCFileWriter::Open(*buffer_output_stream).ValueOrDie();
  ARROW_EXPECT_OK(writer->Write(*input_table));
  ARROW_EXPECT_OK(writer->Close());
  std::shared_ptr buffer = buffer_output_stream->Finish().ValueOrDie();
  std::shared_ptr in_stream(new io::BufferReader(buffer));
  std::unique_ptr reader;
  ARROW_EXPECT_OK(
  adapters::orc::ORCFileReader::Open(in_stream, default_memory_pool(), 
));
  std::shared_ptr actual_output_table;
  ARROW_EXPECT_OK(reader->Read(_output_table));
  AssertTablesEqual(*actual_output_table, *expected_output_table, false, false);
}

I strongly suspect that this is related to the null bitmaps. What do you guys 
think?

Ying

Re: [C++] RandomArrayGenerator::List bugs

2021-02-07 Thread Ying Zhou
A Jira ticket on this bug has been filed: 
https://issues.apache.org/jira/browse/ARROW-11548 
<https://issues.apache.org/jira/browse/ARROW-11548> 

> On Feb 7, 2021, at 3:29 PM, Ying Zhou  wrote:
> 
> Hi,
> 
> Recently I found a weird bug in RandomArrayGenerator.
> 
> RandomArrayGenerator::List consistently produces ListArrays with their length 
> 1 below what they should be according to their documentation. Moreover the 
> bitmaps we have are weird.
> 
> Here is some simple test:
> 
> TEST(TestAdapterWriteNested, ListTest) {
>   int64_t num_rows = 2;
>   static constexpr random::SeedType kRandomSeed2 = 0x0ff1ce;
>   arrow::random::RandomArrayGenerator rand(kRandomSeed2);
>   std::shared_ptr value_array = rand.ArrayOf(int32(), 2 * num_rows, 
> 0.2);
>   std::shared_ptr array = rand.List(*value_array, num_rows, 1);
>   RecordProperty("bitmap",*(array->null_bitmap_data()));
>   RecordProperty("length",array->length());
>   RecordProperty("array",array->ToString());
> }
> 
> Here are the results:
> 
>  timestamp="2021-02-07T15:23:16" classname="TestAdapterWriteNested">
> 
> 
> 
> 
> 
> 
> 
> Here is what RandomArrayGenerator::List should do:
> 
>   /// \brief Generate a random ListArray
>   ///
>   /// \param[in] values The underlying values array
>   /// \param[in] size The size of the generated list array
>   /// \param[in] null_probability the probability of a list value being null
>   /// \param[in] force_empty_nulls if true, null list entries must have 0 
> length
>   ///
>   /// \return a generated Array
>   std::shared_ptr List(const Array& values, int64_t size, double 
> null_probability,
>   bool force_empty_nulls = false);
> 
> Note that the generator failed in at least two aspects:
> 1. The length of the generated array is too low.
> 2. Even when null_probability is set to 1 there are still 1s in the bitmap. 
> 3. The size of the bitmap is larger than the size of the Array.
> 
> I’d like to know where we can find tests for arrow/testing/random. If they 
> are absent I need to write them.
> 
> Thanks,
> Ying
> 



[C++] RandomArrayGenerator::List bugs

2021-02-07 Thread Ying Zhou
Hi,

Recently I found a weird bug in RandomArrayGenerator.

RandomArrayGenerator::List consistently produces ListArrays with their length 1 
below what they should be according to their documentation. Moreover the 
bitmaps we have are weird.

Here is some simple test:

TEST(TestAdapterWriteNested, ListTest) {
  int64_t num_rows = 2;
  static constexpr random::SeedType kRandomSeed2 = 0x0ff1ce;
  arrow::random::RandomArrayGenerator rand(kRandomSeed2);
  std::shared_ptr value_array = rand.ArrayOf(int32(), 2 * num_rows, 0.2);
  std::shared_ptr array = rand.List(*value_array, num_rows, 1);
  RecordProperty("bitmap",*(array->null_bitmap_data()));
  RecordProperty("length",array->length());
  RecordProperty("array",array->ToString());
}

Here are the results:









Here is what RandomArrayGenerator::List should do:

  /// \brief Generate a random ListArray
  ///
  /// \param[in] values The underlying values array
  /// \param[in] size The size of the generated list array
  /// \param[in] null_probability the probability of a list value being null
  /// \param[in] force_empty_nulls if true, null list entries must have 0 length
  ///
  /// \return a generated Array
  std::shared_ptr List(const Array& values, int64_t size, double 
null_probability,
  bool force_empty_nulls = false);

Note that the generator failed in at least two aspects:
1. The length of the generated array is too low.
2. Even when null_probability is set to 1 there are still 1s in the bitmap. 
3. The size of the bitmap is larger than the size of the Array.

I’d like to know where we can find tests for arrow/testing/random. If they are 
absent I need to write them.

Thanks,
Ying



Re: Computational Kernels: the project overview

2021-02-05 Thread Ying Zhou
Hi,

Speaking of the computational kernels I found that Cast needs significant 
improvement. Right now it can not cast a FixedSizeBinary array to a Binary one 
which caused my ORC tests to be unusually long. I plan to significantly expand 
it within 2 months to include nested types and make ORC (and maybe Parquet, 
Feather, CSV etc) testing much simpler. (In case you wonder why this is 
needed..since Arrow generally have a lot more formats than other hence 
to_arrow(from_arrow(table)) and table are usually not equal and casting is 
necessary.) Is this something we want to work on?

Ying

> On Nov 21, 2020, at 6:08 AM, Kirill Lykov  wrote:
> 
> Hi,
> 
> There are some computations kernels in arrow and it looks that this part is
> in active development right now. I wonder if there is a document / some
> emails describing what is the goal and uses cases for this part of the code
> base. Would be very interesting to know a bit more and I would like to
> contribute at some point.
> I'm interested because I develop a Proof-of-concept for a declarative
> language to perform statistical computations on top of gandiva.
> 
> -- 
> Best regards,
> Kirill Lykov



[C++] Enhancements to random Array/ChunkedArray/Table generator as a separate PR?

2021-01-31 Thread Ying Zhou
Hi,

As a part of the process of reducing test size in this pull request 
https://github.com/apache/arrow/pull/8648 
 which contains the ORC writer for 
C++ and Python I wrote a random chunked array generator and a random table 
generator. To reduce test size to ideal levels it will be necessary to improve 
arrow::random::RandomArrayGenerator::ArrayOf to support nested types. I really 
don’t think such work really belongs to the ORC writer PR. Shall I first try to 
get this PR to pass and then file a separate one with improvements in 
arrow/testing/random or shall I file them together as one PR? Thanks!

Ying

Re: [C++] Shall we modify the ORC reader?

2021-01-28 Thread Ying Zhou
Hi,

Really thanks Deepak!

I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now and 
it’s not a serious hassle to do so. Is there anyone who needs the 
read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely in 
my current PR.

Ying

> On Jan 19, 2021, at 8:45 PM, Deepak Majeti  wrote:
> 
> Hi Ying,
> 
> I can help review/merge any ORC C++ contributions.
> 
> 
> On Thu, Jan 14, 2021 at 6:57 PM Ying Zhou  wrote:
> 
>> Well, I haven’t found any. Thankfully ORC does work and I can figure out
>> how it works by testing using simple examples. However I have never managed
>> to contact the ORC community at all. They have never responded to any of my
>> emails to d...@orc.apache.org <mailto:d...@orc.apache.org> I do want to add
>> write Snappy support (which was actually already done 2 years ago by
>> someone else but due to lack of unit testing it was never merged into
>> master. I can write the tests.) and maybe Decimal256 to ORC C++ if they are
>> wiling to review and merge them. If anyone has successfully contacted the
>> ORC community please let me know how.
>> 
>> Best,
>> Ying
>> 
>>> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou  wrote:
>>> 
>>> 
>>> Hi Ying,
>>> 
>>> Is there a semantic description of the ORC data types somewhere?
>>> I've read through https://orc.apache.org/docs/types.html and
>>> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
>>> to explain the intent and constraints of each of the data types.
>>> 
>>> Regards
>>> 
>>> Antoine.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, 11 Jan 2021 21:15:05 -0500
>>> Ying Zhou  wrote:
>>>> Thanks! What about 3?
>>>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs
>> with fields of the structs named ‘key’ and ‘value’?
>>>> 
>>>> 
>>>> 
>>>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau 
>> wrote:
>>>>> 
>>>>> I don't think 1 & 2 make sense. I don't think there are a lot of users
>>>>> reading 2gb strings or lists with 2B objects in them. Saying we just
>> don't
>>>>> support that pattern seems fine for now. I also believe the string and
>> list
>>>>> types have better cross-language support than the large variants.
>>>>> 
>>>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou  wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> While finishing the ORC writer in C++ I found that the ORC reader
>> treats
>>>>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>>>>>> https://issues.apache.org/jira/browse/ARROW-7 <
>>>>>> https://issues.apache.org/jira/browse/ARROW-7>
>>>>>> 
>>>>>> After starting to work on ORC tickets mostly filed by myself I began
>> to
>>>>>> worry that the type mappings in the ORC reader might already be used
>> by
>>>>>> users of Arrow. I wonder whether we should grandfather the issues or
>>>>>> gradually switch to a new type mapping.
>>>>>> 
>>>>>> Here are my proposed changes:
>>>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING
>> type
>>>>>> instead of STRING type since it is large.
>>>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>>>>>> instead of LIST type since it is large.
>>>>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead
>> of
>>>>>> list of structs with hardcoded field names as long as
>>>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ying
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> -- 
> regards,
> Deepak Majeti



[C++] Random table generator and table converter

2021-01-27 Thread Ying Zhou
Hi,

For the C++ tests for the ORC writer there are two functions I need which can 
significantly shorten the tests, namely a generic table generator and a table 
converter. 

For the former I know there is arrow/testing/random.h which can generate random 
arrays. Shall I generate random struct arrays using ArrayOf and then expand 
them into RecordBatches or alternatively shall I generate each array separately 
using ArrayOf and then combine them? By the way I haven’t found any function 
that can directly generate an Arrow Table using a schema, size and 
null_probability. Is there any need for such functionality? If this is useful 
for purposes beyond ORC/Parquet/CSV/etc IO maybe we should write one.

For the latter what I need is a table converter that can recursively convert 
every instance of LargeBinary and FixedSizeBinary into Binary, every instance 
of LargeString into String, every instance of Date64 into Timestamp (unit = 
MILLI), every instance of LargeList and FixedSizeList into List and maybe every 
instance of Map into List of Structs in a table to independently produce the 
expected ORCReader(ORCWriter(Table)) so that I can verify that the ORCWriter is 
working as intended. For this problem I have at least two possible approaches: 
either perform the conversion mainly at array level or do so mainly at scalar 
level. Which one is better?

Thanks,
Ying

P.S. Thanks Antoine and Uwe for the very helpful reviews! The current codebase 
is already very different from the one when it was last reviewed. :)
P.S.S. The table converter is unavoidable due to Arrow having a lot more types 
than ORC.

Plasma C++ error in Travis CI

2021-01-24 Thread Ying Zhou
Hi,

While refactoring my ORC writer so that Antoine and Uwe’s suggestions are 
implemented I found this weird Travis CI error caused by Plasma. Since Plasma 
is no longer maintained do we really need to have it in our Travis CI test? 
Thanks!

Ying
P.S. The job log is here: 
https://travis-ci.com/github/apache/arrow/jobs/474756034
The error message is contained below:

[33/267] Building CXX object 
src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx.o
2021 <>FAILED: src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx.o 
2022 <>/usr/bin/ccache /usr/bin/c++  -DARROW_EXPORTING -DARROW_HAVE_NEON 
-DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_MIMALLOC 
-DARROW_NO_DEPRECATED_API -DARROW_WITH_RE2 -DARROW_WITH_TIMING_TESTS 
-DARROW_WITH_UTF8PROC -DGTEST_LINKED_AS_SHARED_LIBRARY=1 -Isrc -I/arrow/cpp/src 
-I/arrow/cpp/src/generated -isystem /arrow/cpp/thirdparty/flatbuffers/include 
-isystem jemalloc_ep-prefix/src -isystem 
mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.6/include -isystem 
googletest_ep-prefix/include -isystem /arrow/cpp/thirdparty/hadoop/include 
-Wno-noexcept-type -Wno-subobject-linkage  -fdiagnostics-color=always -ggdb -O0 
 -Wall -Wno-conversion -Wno-deprecated-declarations -Wno-sign-conversion 
-Wno-unused-variable -Werror -fno-semantic-interposition -march=armv8-a  -fPIC 
-g -fPIC   -std=c++11 -MD -MT 
src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx.o -MF 
src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx.o.d -o 
src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx.o -c 
src/plasma/CMakeFiles/plasma_objlib.dir/Unity/unity_0_cxx.cxx
2023 <>c++: fatal error: Killed signal terminated program cc1plus
2024 <>compilation terminated.
2025 <>
2026 <>
2027 <>No output has been received in the last 10m0s, this potentially 
indicates a stalled build or something wrong with the build itself.
2028 <>Check the details on how to adjust your build configuration on: 
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received



Re: [VOTE] Release Apache Arrow 3.0.0 - RC2

2021-01-19 Thread Ying Zhou
Hi,

My local version of Snappy is in /usr/local/lib as libsnappy.1.dylib . My 
DYLD_LIBRARY_PATH does have /usr/local/lib.

> On Jan 19, 2021, at 7:57 PM, Sutou Kouhei  wrote:
> 
> Hi,
> 
>> dyld: Library not loaded: @rpath/libsnappy.1.dylib
>>  Referenced from: 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/install/lib/libarrow.300.0.0.dylib
>>  Reason: image not found
> 
> How did you install your Snappy?
> 
> 
> Thanks,
> --
> kou
> 
> In 
>  "Re: [VOTE] Release Apache Arrow 3.0.0 - RC2" on Tue, 19 Jan 2021 19:26:12 
> -0500,
>  Ying Zhou  wrote:
> 
>> There are definitely dependencies issues in at least GLib. I’m going to turn 
>> off Glib and see whether other issues exist.
>> 
>> + make -j4
>> /Library/Developer/CommandLineTools/usr/bin/make  all-recursive
>> Making all in arrow-glib
>>  GEN  stamp-enums.h
>>  GEN  stamp-enums.c
>> touch stamp-enums.c
>> touch stamp-enums.h
>> /Library/Developer/CommandLineTools/usr/bin/make  all-am
>>  CXX  libarrow_glib_la-array-builder.lo
>>  CXX  libarrow_glib_la-basic-array.lo
>>  CXX  libarrow_glib_la-basic-data-type.lo
>>  CXX  libarrow_glib_la-buffer.lo
>>  CXX  libarrow_glib_la-chunked-array.lo
>>  CXX  libarrow_glib_la-codec.lo
>>  CXX  libarrow_glib_la-composite-array.lo
>>  CXX  libarrow_glib_la-composite-data-type.lo
>>  CXX  libarrow_glib_la-datum.lo
>>  CXX  libarrow_glib_la-decimal.lo
>>  CXX  libarrow_glib_la-error.lo
>>  CXX  libarrow_glib_la-field.lo
>>  CXX  libarrow_glib_la-record-batch.lo
>>  CXX  libarrow_glib_la-schema.lo
>>  CXX  libarrow_glib_la-table.lo
>>  CXX  libarrow_glib_la-table-builder.lo
>>  CXX  libarrow_glib_la-tensor.lo
>>  CXX  libarrow_glib_la-type.lo
>>  CC   enums.lo
>>  CXX  libarrow_glib_la-file.lo
>>  CXX  libarrow_glib_la-file-mode.lo
>>  CXX  libarrow_glib_la-input-stream.lo
>>  CXX  libarrow_glib_la-output-stream.lo
>>  CXX  libarrow_glib_la-readable.lo
>>  CXX  libarrow_glib_la-writable.lo
>>  CXX  libarrow_glib_la-writable-file.lo
>>  CXX  libarrow_glib_la-ipc-options.lo
>>  CXX  libarrow_glib_la-metadata-version.lo
>>  CXX  libarrow_glib_la-reader.lo
>>  CXX  libarrow_glib_la-writer.lo
>>  CXX  libarrow_glib_la-compute.lo
>>  CXX  libarrow_glib_la-file-system.lo
>>  CXX  libarrow_glib_la-local-file-system.lo
>>  CXX  libarrow_glib_la-orc-file-reader.lo
>>  CXXLDlibarrow-glib.la
>>  GISCAN   Arrow-1.0.gir
>> dyld: Library not loaded: @rpath/libsnappy.1.dylib
>>  Referenced from: 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/install/lib/libarrow.300.0.0.dylib
>>  Reason: image not found
>> Command 
>> '['/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/Arrow-1.0',
>>  
>> '--introspect-dump=/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/functions.txt,/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/dump.xml']'
>>  died with .
>> make[3]: *** [Arrow-1.0.gir] Error 1
>> make[2]: *** [all] Error 2
>> make[1]: *** [all-recursive] Error 1
>> make: *** [all] Error 2
>> + cleanup
>> + '[' no = yes ']'
>> + echo 'Failed to verify release candidate. See 
>> /var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD 
>> for details.'
>> Failed to verify release candidate. See 
>> /var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD 
>> for details.
>> 
>> 
>>> On Jan 18, 2021, at 10:49 PM, Krisztián Szűcs  wrote:
>>> 
>>> Hi,
>>> 
>>> I would like to propose the following release candidate (RC2) of Apache
>>> Arrow version 3.0.0. This is a release consisting of 678
>>> resolved JIRA issues[1].
>>> 
>>> This release candidate is based on commit:
>>> d613aa68789288d3503dfbd8376a41f2d28b6c9d [2]
>>> 
>>> The source release rc2 is hosted at [3].
>>> The binary artifacts are hosted at [4][5][6][7].
>>> The changelog is located at [8].
>>> 
>>> Please download, verify checksums and signatures, run the unit tests,
>>> and vote on the relea

Re: [VOTE] Release Apache Arrow 3.0.0 - RC2

2021-01-19 Thread Ying Zhou
There are definitely dependencies issues in at least GLib. I’m going to turn 
off Glib and see whether other issues exist.

+ make -j4
/Library/Developer/CommandLineTools/usr/bin/make  all-recursive
Making all in arrow-glib
  GEN  stamp-enums.h
  GEN  stamp-enums.c
touch stamp-enums.c
touch stamp-enums.h
/Library/Developer/CommandLineTools/usr/bin/make  all-am
  CXX  libarrow_glib_la-array-builder.lo
  CXX  libarrow_glib_la-basic-array.lo
  CXX  libarrow_glib_la-basic-data-type.lo
  CXX  libarrow_glib_la-buffer.lo
  CXX  libarrow_glib_la-chunked-array.lo
  CXX  libarrow_glib_la-codec.lo
  CXX  libarrow_glib_la-composite-array.lo
  CXX  libarrow_glib_la-composite-data-type.lo
  CXX  libarrow_glib_la-datum.lo
  CXX  libarrow_glib_la-decimal.lo
  CXX  libarrow_glib_la-error.lo
  CXX  libarrow_glib_la-field.lo
  CXX  libarrow_glib_la-record-batch.lo
  CXX  libarrow_glib_la-schema.lo
  CXX  libarrow_glib_la-table.lo
  CXX  libarrow_glib_la-table-builder.lo
  CXX  libarrow_glib_la-tensor.lo
  CXX  libarrow_glib_la-type.lo
  CC   enums.lo
  CXX  libarrow_glib_la-file.lo
  CXX  libarrow_glib_la-file-mode.lo
  CXX  libarrow_glib_la-input-stream.lo
  CXX  libarrow_glib_la-output-stream.lo
  CXX  libarrow_glib_la-readable.lo
  CXX  libarrow_glib_la-writable.lo
  CXX  libarrow_glib_la-writable-file.lo
  CXX  libarrow_glib_la-ipc-options.lo
  CXX  libarrow_glib_la-metadata-version.lo
  CXX  libarrow_glib_la-reader.lo
  CXX  libarrow_glib_la-writer.lo
  CXX  libarrow_glib_la-compute.lo
  CXX  libarrow_glib_la-file-system.lo
  CXX  libarrow_glib_la-local-file-system.lo
  CXX  libarrow_glib_la-orc-file-reader.lo
  CXXLDlibarrow-glib.la
  GISCAN   Arrow-1.0.gir
dyld: Library not loaded: @rpath/libsnappy.1.dylib
  Referenced from: 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/install/lib/libarrow.300.0.0.dylib
  Reason: image not found
Command 
'['/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/Arrow-1.0',
 
'--introspect-dump=/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/functions.txt,/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD/apache-arrow-3.0.0/c_glib/arrow-glib/tmp-introspect4jf3d1ar/dump.xml']'
 died with .
make[3]: *** [Arrow-1.0.gir] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
+ cleanup
+ '[' no = yes ']'
+ echo 'Failed to verify release candidate. See 
/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD for 
details.'
Failed to verify release candidate. See 
/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.00Y3BiKD for 
details.


> On Jan 18, 2021, at 10:49 PM, Krisztián Szűcs  wrote:
> 
> Hi,
> 
> I would like to propose the following release candidate (RC2) of Apache
> Arrow version 3.0.0. This is a release consisting of 678
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> d613aa68789288d3503dfbd8376a41f2d28b6c9d [2]
> 
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 3.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 3.0.0 because...
> 
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%203.0.0
> [2]: 
> https://github.com/apache/arrow/tree/d613aa68789288d3503dfbd8376a41f2d28b6c9d
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-3.0.0-rc2
> [4]: https://bintray.com/apache/arrow/centos-rc/3.0.0-rc2
> [5]: https://bintray.com/apache/arrow/debian-rc/3.0.0-rc2
> [6]: https://bintray.com/apache/arrow/python-rc/3.0.0-rc2
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/3.0.0-rc2
> [8]: 
> https://github.com/apache/arrow/blob/d613aa68789288d3503dfbd8376a41f2d28b6c9d/CHANGELOG.md
> [9]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-19 Thread Ying Zhou
Oh I see. Yup.

Thanks,
Ying

> On Jan 19, 2021, at 7:56 AM, Antoine Pitrou  wrote:
> 
> 
> Plasma is deprecated and unmaintained, I don't think we should hold the
> release for that.
> 
> Regards
> 
> Antoine.
> 
> 
> Le 19/01/2021 à 13:21, Ying Zhou a écrit :
>> Yup this can be modified. Now for RC2 we have a new error on macOS Catalina.
>> 
>> pyarrow/tests/test_plasma.py::TestPlasmaClient::test_use_full_memory Fatal 
>> Python error: Aborted
>> 
>> Here is what errored out:
>> 
>>for _ in range(100):
>>create_object(
>>self.plasma_client2,
>>np.random.randint(1, DEFAULT_PLASMA_STORE_MEMORY // 20), 0)
>> 
>> Looks like it has trouble creating some of the small objects.
>> 
>> Current thread 0x0001172f0dc0 (most recent call first):
>>  File 
>> "/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/apache-arrow-3.0.0/python/pyarrow/tests/test_plasma.py",
>>  line 70 in create_object_with_id
>>  File 
>> "/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/apache-arrow-3.0.0/python/pyarrow/tests/test_plasma.py",
>>  line 81 in create_object
>>  File 
>> "/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/apache-arrow-3.0.0/python/pyarrow/tests/test_plasma.py",
>>  line 826 in test_use_full_memory
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/python.py",
>>  line 183 in pytest_pyfunc_call
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/callers.py",
>>  line 187 in _multicall
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
>>  line 87 in 
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
>>  line 93 in _hookexec
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/hooks.py",
>>  line 286 in __call__
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/python.py",
>>  line 1641 in runtest
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/runner.py",
>>  line 162 in pytest_runtest_call
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/callers.py",
>>  line 187 in _multicall
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
>>  line 87 in 
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
>>  line 93 in _hookexec
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/hooks.py",
>>  line 286 in __call__
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/runner.py",
>>  line 255 in 
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/runner.py",
>>  line 311 in from_call
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/runner.py",
>>  line 255 in call_runtest_hook
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/runner.py",
>>  line 215 in call_and_report
>>  File 
>> "/var/folders/yb/dc13kd1552vc_x61qzpgmtj

Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-19 Thread Ying Zhou
 line 93 in _hookexec
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/hooks.py",
 line 286 in __call__
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/main.py",
 line 348 in pytest_runtestloop
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in 
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
 line 93 in _hookexec
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/hooks.py",
 line 286 in __call__
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/main.py",
 line 323 in _main
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/main.py",
 line 269 in wrap_session
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/main.py",
 line 316 in pytest_cmdline_main
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in 
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/manager.py",
 line 93 in _hookexec
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/pluggy/hooks.py",
 line 286 in __call__
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/config/__init__.py",
 line 163 in main
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/lib/python3.6/site-packages/_pytest/config/__init__.py",
 line 185 in console_main
  File 
"/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.91YELrHr/test-miniconda/envs/arrow-test/bin/pytest",
 line 8 in 
dev/release/verify-release-candidate.sh: line 358: 26910 Abort trap: 6  
 pytest pyarrow -v --pdb



> On Jan 18, 2021, at 9:58 PM, Sutou Kouhei  wrote:
> 
> Hi,
> 
> It's failed by the following error:
> 
>> E              OSError: [Errno 24] Too many open files
> 
> You need to increase the max number of files you can
> open in a process. But I don't know how to do it on macOS...
> (We can do it by /etc/security/limits.d/ on Linux.)
> 
> 
> Thanks,
> --
> kou
> 
> 
> In 
>  "Re: [VOTE] Release Apache Arrow 3.0.0 - RC0" on Mon, 18 Jan 2021 20:50:34 
> -0600,
>  Wes McKinney  wrote:
> 
>> The plasma executable is failing to start for some reason, but that
>> function should not fail in that way so please open a Jira. I don't
>> think this is a blocking bug; if you'd like to verify without Plasma
>> you can disable it in the verification script.
>> 
>> On Mon, Jan 18, 2021 at 8:46 PM Ying Zhou  wrote:
>>> 
>>> Hi,
>>> 
>>> Thanks for helping me with the verification and fixing errors!
>>> 
>>> Yes, export MACOSX_DEPLOYMENT_TARGET=10.15 does work. However one of the 
>>> Python tests failed:
>>> 
>>> pyarrow/tests/test_plasma.py::TestPlasmaClient::test_subscribe_socket ERROR
>>> 
>>> Here is the traceback:
>>> 
>>> 
>>> Thanks,
>>> Ying
>>> 
>>> On Jan 18, 2021, at 7:18 PM, Sutou Kouhei  wrote:
>>> 
>>> Hi,
>>> 
>>> I do need to mention that it is not in master yet.
>>> 
>>> 
>>> https://github.com/apache/arrow/pull/9254
>>> 
>>> we failed the python one
>>> 
>>> 
>>>

Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-18 Thread Ying Zhou
Thanks! I accidentally allowed RC_NUM to be 0 when this error occurred. If it 
persists when RC_NUM=1 I will file a ticket.

Ying

> On Jan 18, 2021, at 9:50 PM, Wes McKinney  wrote:
> 
> The plasma executable is failing to start for some reason, but that
> function should not fail in that way so please open a Jira. I don't
> think this is a blocking bug; if you'd like to verify without Plasma
> you can disable it in the verification script.
> 
> On Mon, Jan 18, 2021 at 8:46 PM Ying Zhou  wrote:
>> 
>> Hi,
>> 
>> Thanks for helping me with the verification and fixing errors!
>> 
>> Yes, export MACOSX_DEPLOYMENT_TARGET=10.15 does work. However one of the 
>> Python tests failed:
>> 
>> pyarrow/tests/test_plasma.py::TestPlasmaClient::test_subscribe_socket ERROR
>> 
>> Here is the traceback:
>> 
>> 
>> Thanks,
>> Ying
>> 
>> On Jan 18, 2021, at 7:18 PM, Sutou Kouhei  wrote:
>> 
>> Hi,
>> 
>> I do need to mention that it is not in master yet.
>> 
>> 
>> https://github.com/apache/arrow/pull/9254
>> 
>> we failed the python one
>> 
>> 
>> Could you try with "export MACOSX_DEPLOYMENT_TARGET=10.15"?
>> 
>> 
>> Thanks,
>> --
>> kou
>> 
>> In <28ecb21c-9b2a-4961-b2b1-9dcd37b3d...@gmail.com>
>> "Re: [VOTE] Release Apache Arrow 3.0.0 - RC0" on Mon, 18 Jan 2021 04:11:01 
>> -0500,
>> Ying Zhou  wrote:
>> 
>> Thanks! This works. I do need to mention that it is not in master yet. 
>> Moreover after the C# test succeeded we failed the python one on Line 376 
>> (from master).
>> 
>> Ying
>> 
>> + python setup.py build_ext --inplace
>> running build_ext
>> creating 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build
>> creating 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6
>> -- Running cmake for pyarrow
>> cmake 
>> -DPYTHON_EXECUTABLE=/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/test-miniconda/envs/arrow-test/bin/python
>>  
>> -DPython3_EXECUTABLE=/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/test-miniconda/envs/arrow-test/bin/python
>>   -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on 
>> -DPYARROW_BUILD_GANDIVA=on -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on 
>> -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off 
>> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
>> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
>> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
>> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python
>> -- The C compiler identification is AppleClang 12.0.0.1232
>> -- The CXX compiler identification is AppleClang 12.0.0.1232
>> -- Detecting C compiler ABI info
>> -- Detecting C compiler ABI info - done
>> -- Check for working C compiler: 
>> /Library/Developer/CommandLineTools/usr/bin/cc - skipped
>> -- Detecting C compile features
>> -- Detecting C compile features - done
>> -- Detecting CXX compiler ABI info
>> -- Detecting CXX compiler ABI info - failed
>> -- Check for working CXX compiler: 
>> /Library/Developer/CommandLineTools/usr/bin/c++
>> -- Check for working CXX compiler: 
>> /Library/Developer/CommandLineTools/usr/bin/c++ - broken
>> CMake Error at 
>> /usr/local/Cellar/cmake/3.18.4/share/cmake/Modules/CMakeTestCXXCompiler.cmake:59
>>  (message):
>> The C++ compiler
>> 
>>   "/Library/Developer/CommandLineTools/usr/bin/c++"
>> 
>> is not able to compile a simple test program.
>> 
>> It fails with the following output:
>> 
>>   Change Dir: 
>> /private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6/CMakeFiles/CMakeTmp
>> 
>>   Run Build Command(s):/usr/bin/make cmTC_0cfaa/fast && 
>> /Library/Developer/CommandLineTools/usr/bin/make  -f 
>> CMakeFiles/cmTC_0cfaa.dir/build.make CMakeFiles/cmTC_0cfaa.dir/build
>>   Building CXX object CMakeFiles/cmTC_0cfaa.dir/testCXXCompiler.cxx.o
>>   /Library/Developer/CommandLineTools/usr/bin/c++   -isysroot 
>> /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk 
>> -mmacosx-version-min=

Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-18 Thread Ying Zhou
mtjhgn/T/arrow-3.0.0.X.vXcdjnYa/test-miniconda/envs/arrow-test/lib/python3.6/contextlib.py
\f0\b0 \cf2 :81: in __enter__\
return next(self.gen)\
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \
\
plasma_store_memory = 1, use_valgrind = False, use_profiler = False, plasma_directory = None, use_hugepages = False, external_store = None\
\
@contextlib.contextmanager\
def start_plasma_store(plasma_store_memory,\
   use_valgrind=False, use_profiler=False,\
   plasma_directory=None, use_hugepages=False,\
   external_store=None):\
"""Start a plasma store process.\
Args:\
plasma_store_memory (int): Capacity of the plasma store in bytes.\
use_valgrind (bool): True if the plasma store should be started inside\
of valgrind. If this is True, use_profiler must be False.\
use_profiler (bool): True if the plasma store should be started inside\
a profiler. If this is True, use_valgrind must be False.\
plasma_directory (str): Directory where plasma memory mapped files\
will be stored.\
use_hugepages (bool): True if the plasma store should use huge pages.\
external_store (str): External store to use for evicted objects.\
Return:\
A tuple of the name of the plasma store socket and the process ID of\
the plasma store process.\
"""\
if use_valgrind and use_profiler:\
raise Exception("Cannot use valgrind and profiler at the same time.")\
\
tmpdir = tempfile.mkdtemp(prefix='test_plasma-')\
try:\
plasma_store_name = os.path.join(tmpdir, 'plasma.sock')\
plasma_store_executable = os.path.join(\
pa.__path__[0], "plasma-store-server")\
if not os.path.exists(plasma_store_executable):\
# Fallback to sys.prefix/bin/ (conda)\
plasma_store_executable = os.path.join(\
sys.prefix, "bin", "plasma-store-server")\
command = [plasma_store_executable,\
   "-s", plasma_store_name,\
   "-m", str(plasma_store_memory)]\
if plasma_directory:\
command += ["-d", plasma_directory]\
if use_hugepages:\
command += ["-h"]\
if external_store is not None:\
command += ["-e", external_store]\
stdout_file = None\
stderr_file = None\
if use_valgrind:\
command = ["valgrind",\
   "--track-origins=yes",\
   "--leak-check=full",\
   "--show-leak-kinds=all",\
   "--leak-check-heuristics=stdstring",\
   "--error-exitcode=1"] + command\
proc = subprocess.Popen(command, stdout=stdout_file,\
stderr=stderr_file)\
time.sleep(1.0)\
elif use_profiler:\
command = ["valgrind", "--tool=callgrind"] + command\
proc = subprocess.Popen(command, stdout=stdout_file,\
stderr=stderr_file)\
time.sleep(1.0)\
else:\
proc = subprocess.Popen(command, stdout=stdout_file,\
stderr=stderr_file)\
time.sleep(0.1)\
rc = proc.poll()\
if rc is not None:\
raise RuntimeError("plasma_store exited unexpectedly with "\
   "code %d" % (rc,))\
\
yield plasma_store_name, proc\
    finally:\
>   if proc.poll() is None:\

\f1\b \cf3 E   UnboundLocalError: local variable 'proc' referenced before assignment
\f0\b0 \cf2 \
\

\f1\b \cf3 pyarrow/plasma.py
\f0\b0 \cf2 :150: UnboundLocalError\
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>

Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-18 Thread Ying Zhou
Thanks! This works. I do need to mention that it is not in master yet. Moreover 
after the C# test succeeded we failed the python one on Line 376 (from master).

Ying

+ python setup.py build_ext --inplace
running build_ext
creating 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build
creating 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6
-- Running cmake for pyarrow
cmake 
-DPYTHON_EXECUTABLE=/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/test-miniconda/envs/arrow-test/bin/python
 
-DPython3_EXECUTABLE=/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/test-miniconda/envs/arrow-test/bin/python
  -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=on 
-DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on -DPYARROW_BUILD_PARQUET=on 
-DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python
-- The C compiler identification is AppleClang 12.0.0.1232
-- The CXX compiler identification is AppleClang 12.0.0.1232
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc 
- skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: 
/Library/Developer/CommandLineTools/usr/bin/c++
-- Check for working CXX compiler: 
/Library/Developer/CommandLineTools/usr/bin/c++ - broken
CMake Error at 
/usr/local/Cellar/cmake/3.18.4/share/cmake/Modules/CMakeTestCXXCompiler.cmake:59
 (message):
  The C++ compiler

"/Library/Developer/CommandLineTools/usr/bin/c++"

  is not able to compile a simple test program.

  It fails with the following output:

Change Dir: 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_0cfaa/fast && 
/Library/Developer/CommandLineTools/usr/bin/make  -f 
CMakeFiles/cmTC_0cfaa.dir/build.make CMakeFiles/cmTC_0cfaa.dir/build
Building CXX object CMakeFiles/cmTC_0cfaa.dir/testCXXCompiler.cxx.o
/Library/Developer/CommandLineTools/usr/bin/c++   -isysroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk 
-mmacosx-version-min=10.7 -o CMakeFiles/cmTC_0cfaa.dir/testCXXCompiler.cxx.o -c 
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
clang: warning: include path for libstdc++ headers not found; pass 
'-stdlib=libc++' on the command line to use the libc++ standard library instead 
[-Wstdlibcxx-not-found]
Linking CXX executable cmTC_0cfaa
/usr/local/Cellar/cmake/3.18.4/bin/cmake -E cmake_link_script 
CMakeFiles/cmTC_0cfaa.dir/link.txt --verbose=1
/Library/Developer/CommandLineTools/usr/bin/c++  -isysroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk 
-mmacosx-version-min=10.7 -Wl,-search_paths_first 
-Wl,-headerpad_max_install_names 
CMakeFiles/cmTC_0cfaa.dir/testCXXCompiler.cxx.o -o cmTC_0cfaa 
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum 
deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[1]: *** [cmTC_0cfaa] Error 1
make: *** [cmTC_0cfaa/fast] Error 2



  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:22 (project)


-- Configuring incomplete, errors occurred!
See also 
"/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6/CMakeFiles/CMakeOutput.log".
See also 
"/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g/apache-arrow-3.0.0/python/build/temp.macosx-10.7-x86_64-3.6/CMakeFiles/CMakeError.log".
error: command 'cmake' failed with exit status 1
+ cleanup
+ '[' no = yes ']'
+ echo 'Failed to verify release candidate. See 
/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g for 
details.'
Failed to verify release candidate. See 
/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.q4WEAR4g for 
details.
(pyarrow-dev) karlkatzen@x86_64-apple-darwin13 arrow % 


> On Jan 18, 2021, 

Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-15 Thread Ying Zhou
Hi,

This is what happens when I’m following the procedure on my macOS 10.15. Is it 
because of some environmental issue? Is it because the verification failed? 
Thanks!

100  161M  100  161M0 0  13.3M  0  0:00:12  0:00:12 --:--:--  9.8M
+ 
PATH=/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin:/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/test-miniconda/envs/arrow-test/bin:/Users/karlkatzen/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
+ dotnet test

Welcome to .NET Core!
-
Learn more about .NET Core: https://aka.ms/dotnet-docs
Use 'dotnet --help' to see available commands or visit: 
https://aka.ms/dotnet-cli-docs

Telemetry
-
The .NET Core tools collect usage data in order to help us improve your 
experience. The data is anonymous and doesn't include command-line arguments. 
The data is collected by Microsoft and shared with the community. You can 
opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment 
variable to '1' or 'true' using your favorite shell.

Read more about .NET Core CLI Tools telemetry: 
https://aka.ms/dotnet-cli-telemetry

Configuring...
--
A command is running to populate your local package cache to improve restore 
speed and enable offline access. This command takes up to one minute to 
complete and only runs once.
Decompressing 100% 11001 ms
Expanding 100% 61916 ms

ASP.NET Core

Successfully installed the ASP.NET Core HTTPS Development Certificate.
To trust the certificate run 'dotnet dev-certs https --trust' (Windows and 
macOS only). For establishing trust on other platforms refer to the platform 
specific documentation.
For more information on configuring HTTPS see 
https://go.microsoft.com/fwlink/?linkid=848054.
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(137,5):
 error NETSDK1045: The current .NET SDK does not support targeting .NET Core 
3.1.  Either target .NET Core 2.2 or lower, or use a version of the .NET SDK 
that supports .NET Core 3.1. 
[/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj]
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(150,5):
 error NETSDK1045: The current .NET SDK does not support targeting .NET 
Standard 2.1.  Either target .NET Standard 2.0 or lower, or use a version of 
the .NET SDK that supports .NET Standard 2.1. 
[/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj]
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(137,5):
 error NETSDK1045: The current .NET SDK does not support targeting .NET Core 
3.1.  Either target .NET Core 2.2 or lower, or use a version of the .NET SDK 
that supports .NET Core 3.1. 
[/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj]
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(137,5):
 error NETSDK1045: The current .NET SDK does not support targeting .NET Core 
3.1.  Either target .NET Core 2.2 or lower, or use a version of the .NET SDK 
that supports .NET Core 3.1. 
[/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj]
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(137,5):
 error NETSDK1045: The current .NET SDK does not support targeting .NET Core 
3.1.  Either target .NET Core 2.2 or lower, or use a version of the .NET SDK 
that supports .NET Core 3.1. 
[/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj]
/private/var/folders/yb/dc13kd1552vc_x61qzpgmtjhgn/T/arrow-3.0.0.X.e7iWLBfG/apache-arrow-3.0.0/csharp/bin/sdk/2.2.300/Sdks/Microsoft.NET.Sdk/targets/Microsoft.NET.TargetFrameworkInference.targets(137,5):
 error NETSDK1045: The current .NET SDK does not support targeting 

Re: [C++] Shall we modify the ORC reader?

2021-01-14 Thread Ying Zhou
Well, I haven’t found any. Thankfully ORC does work and I can figure out how it 
works by testing using simple examples. However I have never managed to contact 
the ORC community at all. They have never responded to any of my emails to 
d...@orc.apache.org <mailto:d...@orc.apache.org> I do want to add write Snappy 
support (which was actually already done 2 years ago by someone else but due to 
lack of unit testing it was never merged into master. I can write the tests.) 
and maybe Decimal256 to ORC C++ if they are wiling to review and merge them. If 
anyone has successfully contacted the ORC community please let me know how.

Best,
Ying

> On Jan 14, 2021, at 8:39 AM, Antoine Pitrou  wrote:
> 
> 
> Hi Ying,
> 
> Is there a semantic description of the ORC data types somewhere?
> I've read through https://orc.apache.org/docs/types.html and
> https://orc.apache.org/specification/ORCv1/ but those docs don't seem
> to explain the intent and constraints of each of the data types.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> 
> On Mon, 11 Jan 2021 21:15:05 -0500
> Ying Zhou  wrote:
>> Thanks! What about 3? 
>> Shall we convert ORC maps to Arrow maps as opposed to lists of structs with 
>> fields of the structs named ‘key’ and ‘value’?
>> 
>> 
>> 
>>> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau  wrote:
>>> 
>>> I don't think 1 & 2 make sense. I don't think there are a lot of users
>>> reading 2gb strings or lists with 2B objects in them. Saying we just don't
>>> support that pattern seems fine for now. I also believe the string and list
>>> types have better cross-language support than the large variants.
>>> 
>>> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> While finishing the ORC writer in C++ I found that the ORC reader treats
>>>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>>>> https://issues.apache.org/jira/browse/ARROW-7 <
>>>> https://issues.apache.org/jira/browse/ARROW-7>
>>>> 
>>>> After starting to work on ORC tickets mostly filed by myself I began to
>>>> worry that the type mappings in the ORC reader might already be used by
>>>> users of Arrow. I wonder whether we should grandfather the issues or
>>>> gradually switch to a new type mapping.
>>>> 
>>>> Here are my proposed changes:
>>>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
>>>> instead of STRING type since it is large.
>>>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>>>> instead of LIST type since it is large.
>>>> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
>>>> list of structs with hardcoded field names as long as
>>>> the offsets fit into int32. Otherwise we shouldn't return OK.
>>>> 
>>>> Thanks,
>>>> Ying  
>> 
>> 
> 
> 
> 



When will my PR be available in a release?

2021-01-13 Thread Ying Zhou
Hi,

I have implemented the ORC writer in C++ and Python here: 
https://github.com/apache/arrow/pull/8648/ 


I’d like to know when will it be available in a release so that I can file a 
related PR in Pandas to use my ORC writer. Since it hasn’t been reviewed yet it 
might be too late for Arrow 3.0.0. Is the next release 4.0.0 in April? When 
shall I reasonably get all the enhancements e.g. the ORC-related Jira tickets I 
assigned to myself ready before the release? Thanks!

Ying

Re: [C++] Shall we modify the ORC reader?

2021-01-11 Thread Ying Zhou
Thanks! What about 3? 
Shall we convert ORC maps to Arrow maps as opposed to lists of structs with 
fields of the structs named ‘key’ and ‘value’?



> On Jan 10, 2021, at 6:45 PM, Jacques Nadeau  wrote:
> 
> I don't think 1 & 2 make sense. I don't think there are a lot of users
> reading 2gb strings or lists with 2B objects in them. Saying we just don't
> support that pattern seems fine for now. I also believe the string and list
> types have better cross-language support than the large variants.
> 
> On Sun, Jan 10, 2021 at 8:49 AM Ying Zhou  wrote:
> 
>> Hi,
>> 
>> While finishing the ORC writer in C++ I found that the ORC reader treats
>> certain types in rather awkward ways. Hence I filed this Jira ticket:
>> https://issues.apache.org/jira/browse/ARROW-7 <
>> https://issues.apache.org/jira/browse/ARROW-7>
>> 
>> After starting to work on ORC tickets mostly filed by myself I began to
>> worry that the type mappings in the ORC reader might already be used by
>> users of Arrow. I wonder whether we should grandfather the issues or
>> gradually switch to a new type mapping.
>> 
>> Here are my proposed changes:
>> 1. The ORC STRING type should be converted to the Arrow LARGE_STRING type
>> instead of STRING type since it is large.
>> 2. The ORC LIST type should be converted to the Arrow LARGE_LIST type
>> instead of LIST type since it is large.
>> 3. The ORC MAP type should be converted to the Arrow MAP type instead of
>> list of structs with hardcoded field names as long as
>> the offsets fit into int32. Otherwise we shouldn't return OK.
>> 
>> Thanks,
>> Ying



[C++] Shall we modify the ORC reader?

2021-01-10 Thread Ying Zhou
Hi,

While finishing the ORC writer in C++ I found that the ORC reader treats 
certain types in rather awkward ways. Hence I filed this Jira ticket: 
https://issues.apache.org/jira/browse/ARROW-7 


After starting to work on ORC tickets mostly filed by myself I began to worry 
that the type mappings in the ORC reader might already be used by users of 
Arrow. I wonder whether we should grandfather the issues or gradually switch to 
a new type mapping.

Here are my proposed changes:
1. The ORC STRING type should be converted to the Arrow LARGE_STRING type 
instead of STRING type since it is large.
2. The ORC LIST type should be converted to the Arrow LARGE_LIST type instead 
of LIST type since it is large.
3. The ORC MAP type should be converted to the Arrow MAP type instead of list 
of structs with hardcoded field names as long as 
the offsets fit into int32. Otherwise we shouldn't return OK.

Thanks,
Ying

[C++][CI] openssl not installed in AMD64 Windows 2019 C++

2021-01-07 Thread Ying Zhou
Hi,

Thanks Neal for fixing the numpy blocker (ARROW-11152)! It seems that there is 
another weird recent dependency error here: 
https://github.com/apache/arrow/pull/8648/checks?check_run_id=1660537561 
 
Do we know what’s going on? Thanks! Here is what went wrong:


Run choco install -y --no-progress openssl
37
 
Chocolatey
 v0.10.15
38
 
Installing
 the following packages:
39
 
openssl
40
 
By
 installing you accept licenses for the packages.
41
 
openssl
 not installed. An error occurred during installation:
42
 

 The remote server returned an error: (525) Origin SSL Handshake Error. Origin 
SSL Handshake Error
43
 
openssl
 package files install completed. Performing other installation steps.
44
 
The
 install of openssl was NOT successful.
45
 
openssl
 not installed. An error occurred during installation:
46
 

 The remote server returned an error: (525) Origin SSL Handshake Error. Origin 
SSL Handshake Error
47
 

48
 
Chocolatey
 installed 0/1 packages. 1 packages failed.
49
 

 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
50
 

51
 
Failures
52
 

 - openssl (exited 1) - openssl not installed. An error occurred during 
installation:
53
 

 The remote server returned an error: (525) Origin SSL Handshake Error. Origin 
SSL Handshake Error
54
 
Error:
 Process completed with exit code 1.

Re: Github Actions feedback time

2021-01-06 Thread Ying Zhou
Hi,

Sorry for not noticing this thread earlier. Looks like in addition to unusually 
slow feedback time that did not happen last Sunday or earlier there are also 
weird installation errors such as ‘can not install numpy’ as well. Can these be 
due to some form of timeout?

Here is my C++ PR:
https://github.com/apache/arrow/pull/8648/checks

Ying

> On Jan 5, 2021, at 7:33 AM, Krisztián Szűcs  wrote:
> 
> Hi,
> 
> I'm concerned about the overall feedback time we have on pull requests.
> I have a simple PR to make the comment bot working again, but no
> builds are running even after 30 minutes.
> I can see 2-4 running builds, which will make our work much harder
> right before the release.
> 
> I wasn't following the build queue's state lately, but I think we
> should consolidate the build configurations.
> Possible candidates are the PR* workflows and good to have tests which
> we could trigger on master instead.
> 
> Opinions?
> 
> Regards, Krisztian



[C++] Weird Rust Linter error in CICD & Float/Double equality

2020-12-30 Thread Ying Zhou
Hi,

When finalizing my Arrow2ORC C++ pull request I found a weird Rust-related and 
IPC-related error in the Linter that didn’t happen just 2 days ago despite my 
code having nothing to do with either Rust or IPC. Here is the check: 
https://github.com/apache/arrow/pull/8648/checks?check_run_id=1625423680 
 Here 
is the part of the output I think is relevant:

source_dir /arrow/cpp/src --quiet
1413
 
[1/1]
 cd /tmp/arrow-lint-_9i678pu/cpp-build && /usr/local/bin/python 
/arrow/cpp/build-support/lint_cpp_cli.py /arrow/cpp/src
1414
 
INFO:archery:Running
 Python formatter (autopep8)
1415
 
INFO:archery:Running
 Python linter (flake8)
1416
 
INFO:archery:Running
 cmake-format linters
1417
 
WARNING:archery:run-cmake-format
 modifies files, regardless of --fix
1418
 
INFO:archery:Running
 apache-rat linter
1419
 
INFO:archery:Running
 R linter
1420
 
INFO:archery:Running
 Rust linter
1421
 
Diff
 in /arrow/rust/arrow/src/ipc/reader.rs at line 160:
1422
 

 let null_count = struct_node.null_count() as usize;
1423
 

 let struct_array = if null_count > 0 {
1424
 

 // create struct array from fields, arrays and null data
1425
 
-
StructArray::from((
1426
 
-
struct_arrays,
1427
 
-
null_buffer,
1428
 
-
))
1429
 
+
StructArray::from((struct_arrays, null_buffer))
1430
 

 } else {
1431
 

 StructArray::from(struct_arrays)
1432
 

 };
1433
 
INFO:archery:Running
 Docker linter
1434
 
Error:
 `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run 
--rm ubuntu-lint` exited with a non-zero exit code 1, see the process log above.

There is no C++ or reasonbly C++-related error. Does anyone know why the error 
happen?

I also would like to ask about table equality when some columns are 
float/double. In this case do we have some built-in epsilon so that 0.6 == 0.6 
work? Right now I have separate tests for these types that looks like the 
following and they are pretty clumsy:

EXPECT_TRUE(outputTable->schema()->Equals(*(table->schema(;
  EXPECT_TRUE(outputTable->column(0)
  ->chunk(0)
  ->Slice(0, numRows / 2)
  ->ApproxEquals(table->column(0)->chunk(1)));
  EXPECT_TRUE(outputTable->column(0)
  ->chunk(0)
  ->Slice(numRows / 2, numRows / 2)
  ->ApproxEquals(table->column(0)->chunk(3)));
  EXPECT_TRUE(outputTable->column(1)
  ->chunk(0)
  ->Slice(0, numRows / 2)
  ->ApproxEquals(table->column(1)->chunk(1)));
  EXPECT_TRUE(outputTable->column(1)
  ->chunk(0)
  ->Slice(numRows / 2, numRows / 2)
  ->ApproxEquals(table->column(1)->chunk(3)));

Thanks,
Ying

[C++] Includes and failing checks in Python and C Glib & Ruby

2020-12-18 Thread Ying Zhou
Hi,

As I try to finalize this pull request 
(https://github.com/apache/arrow/pull/8648 
) I found that a single necessary 
ORC include (liborc::WriterOptions) in arrow/adapters/orc/adapter.h broke one 
Python check and two C Glib & Ruby checks. Since there is nothing wrong with 
including liborc::WriterOptions into arrow/adapters/orc/adapter.h so that users 
who write Arrow Tables into ORC files can specify ORC writer options there seem 
to be three paths forward:

1. Include “orc/OrcFile.hh” in a way that does not offend the three checks.
2. Make changes to Arrow C Glib & Python so that the checks recognize the 
inclusion.
3. A combination of 1 and 2.

What’s the best approach here?

Thanks,
Ying

[C++] Are stream adapters necessary for the Arrow2ORC adapter?

2020-12-12 Thread Ying Zhou
Hi,

As the developer who is testing the APIs in the Arrow2ORC adapter I have a 
question on whether I should necessarily take some Arrow I/O interfaces as 
parameters. Are we not supposed to directly use the path of the file we write 
to and directly use an ORC function to open it? If we do need to exclusively 
use classes in arrow/io to open files given how the Arrow integration with 
Parquet and ORC2Arrow adapter work it seems that I should wrap 
arrrow::io::OutputStream in an implementation of orc::OutputStream . Is it one 
of the right ways to do it? Thanks!

Ying Zhou

[C++] Sparse Unions and CICD tests

2020-11-29 Thread Ying Zhou
Hi,

Really thanks for the help you guys gave me in the past! Tonight I would like 
to ask two questions.

First of all it seems that in the C++ implementation of sparse unions it is 
possible to construct a union array of length 8 from two child arrays of length 
4 with dense union-like behavior. Is it intended? Can I throw an error in my 
Arrow2ORC API when a sparse union array has some child array with a different 
length?

Secondly I would like to know whether all CICD tests need to pass before a pull 
request can be merged. The C++ part of Arrow2ORC development 
(https://issues.apache.org/jira/browse/ARROW-3014 
) is almost over. I’m doing 
very thorough testing (BTW I have fixed the long unit test problem I asked some 
time ago) and will seek the merge fairly soon. Right now it seems that all the 
C++ and lint tests pass. However some python and glib tests have failed. All 
the python and cglib code I currently have are from master. Can I still ask for 
the PR merge if CICD tests unrelated to my C++ code fail?

Thanks,
Ying

Re: [C++] 0x00 in Binary type

2020-11-18 Thread Ying Zhou
Sure!

BinaryBuilder builder;
char d[] = "\x00\x01\xbf\x5b”;
(void)(builder.Append(d));
std::shared_ptr array;
(void)(builder.Finish());
int32_t dataLength = 0;
auto aarray = std::static_pointer_cast(array);
const uint8_t* data = aarray->GetValue(0, );
data = aarray->GetValue(3, );
RecordProperty("l3", dataLength);
RecordProperty("30", data[0]);
RecordProperty("31", data[1]);
RecordProperty("32", data[2]);
RecordProperty("33", data[3]);

We need Google Test to use RecordProperty. dataLength is 0 instead of 4 and 
data[i] are 255, 0, 0 and 0 respectively. 

My JIRA ID is yingzhou474.


> On Nov 18, 2020, at 1:49 PM, Antoine Pitrou  wrote:
> 
> 
> Hello,
> 
> Le 18/11/2020 à 19:06, Ying Zhou a écrit :
>> 
>> According to the documentation BINARY is "Variable-length bytes (no 
>> guarantee of UTF8-ness)”. However in practice if I embed 0x00 in the middle 
>> of a char array and Append it to a BinaryBuilder the 0x00 is converted to 
>> 0xff, everything after it is not appended and the length is computed as if 
>> the 0x00 and everything after it don’t exist (i.e. standard STRING behavior).
> 
> Can you post some code showing how you build the array?
> 
>> P.S. Please allow me to assign Jira tickets to myself. Really thanks!
> 
> What is your JIRA id?
> 
> Regards
> 
> Antoine.



[C++] 0x00 in Binary type

2020-11-18 Thread Ying Zhou
Hello,

According to the documentation BINARY is "Variable-length bytes (no guarantee 
of UTF8-ness)”. However in practice if I embed 0x00 in the middle of a char 
array and Append it to a BinaryBuilder the 0x00 is converted to 0xff, 
everything after it is not appended and the length is computed as if the 0x00 
and everything after it don’t exist (i.e. standard STRING behavior). I would 
like to know whether it is intended. If it is then we should change the
 documentation and explicitly state that 0x00 is not allowed. Otherwise we need 
to change the implementation to allow it.

Thanks,
Ying

P.S. Please allow me to assign Jira tickets to myself. Really thanks!

Re: [ANNOUNCE] New Arrow committer: Andrew Lamb

2020-11-10 Thread Ying Zhou
Congratulations Andrew!

> On Nov 10, 2020, at 10:42 AM, Andy Grove  wrote:
> 
> On behalf of the Arrow PMC, I'm happy to announce that Andrew Lamb has
> accepted an invitation to become a committer on Apache Arrow.
> 
> Welcome, and thank you for your contributions!



[C++] Type_codes and child_ids for Unions & test time concerns

2020-11-08 Thread Ying Zhou
The work of converting Arrow Arrays, ChunkedArrays, RecordBatches and Tables to 
ORC files is about 50% done. Now I have two questions.

First of all I would like to ask why we use both type_codes and child_ids for 
Union types. It seems that we can already cover the logical types a union has 
using type_codes alone. What’s the point of using child_ids?

Secondly I would like to ask about the maximum amount of time permitted when 
running unit tests. I will definitely profile and speed up my tests prior to 
the pull request so I would like to know about the expectation first. 

Thanks,
Ying Zhou

[C++] Arrow debug with ORC & unittest can not be built

2020-10-24 Thread Ying Zhou
Hi,

I’m using the master version of Arrow. In order to test my Arrow2ORC feature I 
got a new copy of Arrow and tried to make it with debug on. It turns out that 
one ORC dependency, libhdfspp_static.a, can not be found which caused linking 
of arrow-orc-adapter-test to be impossible.

Here is my command:

cmake -DARROW_WITH_UTF8PROC=OFF -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON 
-DARROW_ORC=ON -DARROW_PYTHON=ON -DORC_ROOT=/usr/local 
-DOPENSSL_ROOT_DIR=/usr/local/opt/openssl ../..

Here is the error message I got:

[ 93%] Linking CXX executable ../../../../debug/arrow-orc-adapter-test
Undefined symbols for architecture x86_64:
  "hdfs::FileSystem::New(hdfs::IoService*&, std::__1::basic_string, std::__1::allocator > const&, hdfs::Options 
const&)", referenced from:
  
orc::HdfsFileInputStream::HdfsFileInputStream(std::__1::basic_string, std::__1::allocator >) in 
liborc.a(OrcHdfsFile.cc.o)
  "hdfs::ConfigParser::LoadDefaultResources()", referenced from:
  
orc::HdfsFileInputStream::HdfsFileInputStream(std::__1::basic_string, std::__1::allocator >) in 
liborc.a(OrcHdfsFile.cc.o)
  "hdfs::ConfigParser::ConfigParser()", referenced from:
  
orc::HdfsFileInputStream::HdfsFileInputStream(std::__1::basic_string, std::__1::allocator >) in 
liborc.a(OrcHdfsFile.cc.o)
  "hdfs::ConfigParser::~ConfigParser()", referenced from:
  
orc::HdfsFileInputStream::HdfsFileInputStream(std::__1::basic_string, std::__1::allocator >) in 
liborc.a(OrcHdfsFile.cc.o)
  "hdfs::URI::parse_from_string(std::__1::basic_string, std::__1::allocator > const&)", referenced 
from:
…
…
"hdfs::Status::ToString() const", referenced from:
  orc::HdfsFileInputStream::read(void*, unsigned long long, unsigned long 
long) in liborc.a(OrcHdfsFile.cc.o)
  
orc::HdfsFileInputStream::HdfsFileInputStream(std::__1::basic_string, std::__1::allocator >) in 
liborc.a(OrcHdfsFile.cc.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [debug/arrow-orc-adapter-test] Error 1
make[1]: *** [src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/all] 
Error 2
make: *** [all] Error 2

I would like to know whether it is impossible to have a debug build of Apache 
Arrow with ORC without editing some CMakeLists files at least on macOS 
Catalina. If that’s the case then we have a bug.

Thanks,
Ying

Re: [ANNOUNCE] New Arrow PMC chair: Wes McKinney

2020-10-24 Thread Ying Zhou
Congratulations Wes! :)

Ying

> On Oct 23, 2020, at 7:35 PM, Jacques Nadeau  wrote:
> 
> I am pleased to announce that we have a new PMC chair and VP as per our
> newly started tradition of rotating the chair once a year. I have resigned
> and Wes was duly elected by the PMC and approved unanimously by the board.
> 
> Please join me in congratulating Wes!
> 
> Jacques



Re: [C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-20 Thread Ying Zhou
Really thanks!

After more experimentation with liborc::ColumnVectorBatch this morning I found 
that it is actually spaced so there is no need to write another function to 
efficiently append “compressed” values. This also simplifies the Arrow2ORC 
adapter I’m working on.

> On Oct 20, 2020, at 12:55 AM, Micah Kornfield  wrote:
> 
> For reference, that parquet uses to space out values is in rle_decoder.h
> [1].  This uses both BitBlockCounter and BitRunReader.  BitBlockCounter is
> faster than BitRunReader but on micro-benchmarks BitRunReader still
> provides some benefits assuming nulls are fairly infrequent.
> 
> It is worth noting that this code assumes preallocated arrays (i.e. it
> doesn't use builders).
> 
> [1]
> https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h
> 
> On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney  wrote:
> 
>> hi Ying, the code in adapter_util.cc doesn't look right to me unless
>> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
>> where there is a null). We have quite a bit of code in Parquet that
>> deals specifically with this issue -- I'm not sure if we have a
>> ready-made function that will efficiently append the "compressed"
>> value efficiently to a builder, but we certianly have all the tools
>> you need to do so (e.g. the BitRunReader is helpful here)
>> 
>> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou  wrote:
>>> 
>>> Hi,
>>> 
>>> Unlike Arrow in ORC when an entry is null it is only recorded in the
>> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any
>> DATA stream for any type including numeric types. Hence the notNull (aka
>> PRESENT) and data buffers from ORC generally don’t have the same size.
>>> 
>>> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <
>> http://adapter_util.cc/> line 126 it is possible to directly use
>> AppendValues to call builder->AppendValues(source, length, valid_bytes)
>> with builder being an Int64Builder with source and valid_bytes having
>> different sizes which doesn’t seem to be reasonable. May I ask whether this
>> is actually valid usage of AppendValues? Thanks!
>>> 
>>> 
>>> Best,
>>> Ying Zhou
>> 



[C++] AppendValues for numeric types with invalid slots omitted from source

2020-10-18 Thread Ying Zhou
Hi,

Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT 
stream (equivalent to the validity bitmap in Arrow) but not in any DATA stream 
for any type including numeric types. Hence the notNull (aka PRESENT) and data 
buffers from ORC generally don’t have the same size.

However according to cpp/src/arrow/adaptes/orc/adapter_util.cc 
<http://adapter_util.cc/> line 126 it is possible to directly use AppendValues 
to call builder->AppendValues(source, length, valid_bytes) with builder being 
an Int64Builder with source and valid_bytes having different sizes which 
doesn’t seem to be reasonable. May I ask whether this is actually valid usage 
of AppendValues? Thanks!


Best,
Ying Zhou

[C++] Arrow to ORC type conversion

2020-10-18 Thread Ying Zhou
Hi,

I’m developing the adapter that converts Arrow Arrays, ChunkedArrays, 
RecordBatches and Tables into ORC files. Given the ORC Specification and Arrow 
Columnar Format. 

Here is my current type mapping:

Type::type::NA -> nulllptr
Type::type::BOOL -> liborc::TypeKind::BOOLEAN
Type::type::UINT8 -> liborc::TypeKind::BYTE
Type::type::INT8 -> liborc::TypeKind::BYTE
Type::type::UINT16 -> liborc::TypeKind::SHORT
Type::type::INT16 -> liborc::TypeKind::SHORT
Type::type::UINT32 -> liborc::TypeKind::INT
Type::type::INT32 -> liborc::TypeKind::INT
Type::type::INTERVAL_MONTH -> liborc::TypeKind:INT
Type::type::UINT64 -> liborc::TypeKind::LONG
Type::type::INT64 -> liborc::TypeKind::LONG
Type::type::INTERVAL_DAY_TIME -> liborc::TypeKind:LONG
Type::type::DURATION -> liborc::TypeKind::LONG
Type::type::HALF_FLOAT -> liborc::TypeKind::FLOAT
Type::type::FLOAT -> liborc::TypeKind::FLOAT
Type::type::DOUBLE -> liborc::TypeKind::DOUBLE
Type::type::STRING -> liborc::TypeKind::STRING
Type::type::LARGE_STRING -> liborc::TypeKind::STRING
Type::type::FIXED_SIZE_BINARY -> liborc::TypeKind::CHAR
Type::type::BINARY -> liborc::TypeKind::BINARY
Type::type::LARGE_BINARY -> liborc::TypeKind::BINARY
Type::type::DATE32 -> liborc::TypeKind::DATE
Type::type::TIMESTAMP -> liborc::TypeKind::TIMESTAMP
Type::type::TIME32 -> liborc::TypeKind::TIMESTAMP
Type::type::TIME64 -> liborc::TypeKind::TIMESTAMP
Type::type::DATE64 -> liborc::TypeKind::TIMESTAMP
Type::type::DECIMAL -> liborc::TypeKind::DECIMAL
Type::type::LIST -> liborc::TypeKind::LIST
Type::type::FIXED_SIZE_LIST -> liborc::TypeKind::LIST
Type::type::LARGE_LIST -> liborc::TypeKind::LIST
Type::type::STRUCT -> liborc::TypeKind::STRUCT
Type::type::MAP -> liborc::TypeKind::MAP
Type::type::DENSE_UNION -> liborc::TypeKind::UNION
Type::type::SPARSE_UNION -> liborc::TypeKind::UNION
Type::type::DICTIONARY -> the ORC version of its value type

There are some concerns particularly related to duration types which don’t 
exist for Apache ORC which I have to convert to integers. Is my current mapping 
reasonable? Thanks!

Best,
Ying Zhou

ORC writer

2020-08-29 Thread Ying Zhou
Hi,

I’m interested in writing a binder so that we can write ORC files in Arrow. I 
likely should contribute mostly to 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/adapters/orc 
<https://github.com/apache/arrow/tree/master/cpp/src/arrow/adapters/orc> as 
well as editing the relevant Python/Cython files, right? Moreover I would like 
to ask whether there is any existing branch with partly finished work on ORC 
writers. Thanks!

Ying Zhou