[jira] [Commented] (ARROW-12122) [Python] Cannot install via pip. M1 mac

2021-03-30 Thread Bastien Boutonnet (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311257#comment-17311257
 ] 

Bastien Boutonnet commented on ARROW-12122:
---

[~kou] I see that's a shame. Could I help in any way?

> [Python] Cannot install via pip. M1 mac
> ---
>
> Key: ARROW-12122
> URL: https://issues.apache.org/jira/browse/ARROW-12122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bastien Boutonnet
>Priority: Major
>
> when doing {{pip install pyarrow --no-use-pep517}}
> {noformat}
> Collecting pyarrow
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
> Requirement already satisfied: numpy>=1.16.6 in 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/lib/python3.8/site-packages
>  (from pyarrow) (1.20.2)
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (setup.py) ... error
>  ERROR: Command errored out with exit status 1:
>  command: 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/bin/python
>  -u -c 'import sys, setuptools, tokenize; sys.argv[0] = 
> '"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';
>  
> __file__='"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';f=getattr(tokenize,
>  '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' 
> bdist_wheel -d 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-wheel-vpkwqzyi
>  cwd: 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/
>  Complete output (238 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.macosx-11.2-arm64-3.8
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cffi.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/dataset.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compute.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_ipc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_convert_builtin.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_misc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_gandiva.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/strategies.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_adhoc_memory_leak.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/arrow_7980.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/util.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_orc.py -> 
> build/lib

[jira] [Created] (ARROW-12148) [C++][FlightRPC] Add gRPC TLS benchmark

2021-03-30 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-12148:


 Summary: [C++][FlightRPC] Add gRPC TLS benchmark
 Key: ARROW-12148
 URL: https://issues.apache.org/jira/browse/ARROW-12148
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: Yibo Cai
Assignee: Yibo Cai


We have flightrpc benchmark through "grpc+tcp" and "grpc+unix" connections, but 
not "grpc+tls".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

2021-03-30 Thread Yordan Pavlov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311291#comment-17311291
 ] 

Yordan Pavlov commented on ARROW-11897:
---

[~jorgecarleitao] I would be happy to have a chat in Slack, but it appears that 
an @apache.org email address is necessary to join and I don't have one.

Also, I noticed that in your parquet2 repo, a separate page iterator is created 
for each row group, very similar to how it works currently. I was planning to 
wrap multiple row group page iterators into a single iterator returning a 
sequence of pages from multiple row groups (see the code snippet in my previous 
comment).

> [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays
> --
>
> Key: ARROW-11897
> URL: https://issues.apache.org/jira/browse/ARROW-11897
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> The overall goal is to create an efficient pipeline from Parquet page data 
> into Arrow arrays, with as little intermediate conversion and memory 
> allocation as possible. It is assumed, that for best performance, we favor 
> doing fewer but larger copy operations (rather than more but smaller). 
> Such a pipeline would need to be flexible in order to enable high performance 
> implementations in several different cases:
>  (1) In some cases, such as plain-encoded number array, it might even be 
> possible to copy / create the array from a single contiguous section from a 
> page buffer. 
>  (2) In other cases, such as plain-encoded string array, since values are 
> encoded in non-contiguous slices (where value bytes are separated by length 
> bytes) in a page buffer contains multiple values, individual values will have 
> to be copied separately and it's not obvious how this can be avoided.
>  (3) Finally, in the case of bit-packing encoding and smaller numeric values, 
> page buffer data has to be decoded / expanded before it is ready to copy into 
> an arrow arrow, so a `Vec` will have to be returned instead of a slice 
> pointing to a page buffer.
> I propose that the implementation is split into three layers - (1) decoder, 
> (2) column reader and (3) array converter layers (not too dissimilar from the 
> current implementation, except it would be based on Iterators), as follows:
> *(1) Decoder layer:*
> A decoder output abstraction that enables all of the above cases and 
> minimizes intermediate memory allocation is `Iterator AsRef<[u8]>)>`.
>  Then in case (1) above, where a numeric array could be created from a single 
> contiguous byte slice, such an iterator could return a single item such as 
> `(1024, &[u8])`. 
>  In case (2) above, where each string value is encoded as an individual byte 
> slice, but it is still possible to copy directly from a page buffer, a 
> decoder iterator could return a sequence of items such as `(1, &[u8])`. 
>  And finally in case (3) above, where bit-packed values have to be 
> unpacked/expanded, and it's NOT possible to copy value bytes directly from a 
> page buffer, a decoder iterator could return items representing chunks of 
> values such as `(32, Vec)` where bit-packed values have been unpacked and 
>  the chunk size is configured for best performance.
> Another benefit of an `Iterator`-based abstraction is that it would prepare 
> the parquet crate for  migration to `async` `Stream`s (my understanding is 
> that a `Stream` is effectively an async `Iterator`).
> *(2) Column reader layer:*
> Then a higher level iterator could combine a value iterator and a (def) level 
> iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and 
> `NullSequence(count)` items from which an arrow array can be created 
> efficiently.
> In future, a higher level iterator (for the keys) could be combined with a 
> dictionary value iterator to create a dictionary array.
> *(3) Array converter layer:*
> Finally, Arrow arrays would be created from a (generic) higher-level 
> iterator, using a layer of array converters that know what the value bytes 
> and nulls mean for each type of array.
>  
> [~nevime] , [~Dandandan] , [~jorgecarleitao] let me know what you think
> Next steps:
>  * split work into smaller tasks that could be done over time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12056) [C++] Create sequencing AsyncGenerator

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-12056.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9779
[https://github.com/apache/arrow/pull/9779]

> [C++] Create sequencing AsyncGenerator
> --
>
> Key: ARROW-12056
> URL: https://issues.apache.org/jira/browse/ARROW-12056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> ARROW-7001 needs a sequencing operator to reorder fragments & scan tasks that 
> arrive out of order.  This AsyncGenerator would poll source and buffer 
> results until the "next" result arrives.  For example, given a source of 
> 6,2,1,3,4,5 the operator would return 1,2,3,4,5,6 and would need to buffer 2 
> items (6 & 2 at the beginning).
> The Next(T t) will be configurable via function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12056) [C++] Create sequencing AsyncGenerator

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-12056:
--

Assignee: Weston Pace

> [C++] Create sequencing AsyncGenerator
> --
>
> Key: ARROW-12056
> URL: https://issues.apache.org/jira/browse/ARROW-12056
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> ARROW-7001 needs a sequencing operator to reorder fragments & scan tasks that 
> arrive out of order.  This AsyncGenerator would poll source and buffer 
> results until the "next" result arrives.  For example, given a source of 
> 6,2,1,3,4,5 the operator would return 1,2,3,4,5,6 and would need to buffer 2 
> items (6 & 2 at the beginning).
> The Next(T t) will be configurable via function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12088) [Python][C++] Warning about offsetof in pyarrow.dataset.RecordBatchIterator

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-12088.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9812
[https://github.com/apache/arrow/pull/9812]

> [Python][C++] Warning about offsetof in pyarrow.dataset.RecordBatchIterator
> ---
>
> Key: ARROW-12088
> URL: https://issues.apache.org/jira/browse/ARROW-12088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I just saw the following warning when compiling PyArrow:
> {code}
> /home/antoine/arrow/dev/python/build/temp.linux-x86_64-3.7/_dataset.cpp:46102:151:
>  warning: offset of on non-standard-layout type 'struct 
> __pyx_obj_7pyarrow_8_dataset_RecordBatchIterator' [-Winvalid-offsetof]
>   if (__pyx_type_7pyarrow_8_dataset_RecordBatchIterator.tp_weaklistoffset == 
> 0) __pyx_type_7pyarrow_8_dataset_RecordBatchIterator.tp_weaklistoffset = 
> offsetof(struct __pyx_obj_7pyarrow_8_dataset_RecordBatchIterator, 
> __pyx_base.__weakref__);
>   
> ^ 
> ~~
> /usr/lib/llvm-10/lib/clang/10.0.0/include/stddef.h:104:24: note: expanded 
> from macro 'offsetof'
> #define offsetof(t, d) __builtin_offsetof(t, d)
>^ ~
> 1 warning generated.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12149) [Dev] Archery benchmark test case is failing

2021-03-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12149:
---

 Summary: [Dev] Archery benchmark test case is failing
 Key: ARROW-12149
 URL: https://issues.apache.org/jira/browse/ARROW-12149
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 4.0.0


See build log 
https://github.com/apache/arrow/pull/9767/checks?check_run_id=2220192782#step:7:113

cc [~dianaclarke]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10364) [Dev][Archery] Test is failed with semver 2.13.0

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10364.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 8501
[https://github.com/apache/arrow/pull/8501]

> [Dev][Archery] Test is failed with semver 2.13.0
> 
>
> Key: ARROW-10364
> URL: https://issues.apache.org/jira/browse/ARROW-10364
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Archery, Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/runs/1276765550?check_suite_focus=true
> {noformat}
> === FAILURES 
> ===
> _ test_release_basics 
> __
> fake_jira = 
> def test_release_basics(fake_jira):
> >   r = Release.from_jira("1.0.0", jira=fake_jira)
> archery/tests/test_release.py:202: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> archery/release.py:281: in from_jira
> version = jira.project_version(version, project='ARROW')
> archery/release.py:93: in project_version
> return versions[versions.index(version_string)]
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:203:
>  in wrapper
> return operator(self, other)
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:573:
>  in __eq__
> return self.compare(other) == 0
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:493:
>  in compare
> other = cls.parse(other)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> cls = , version = '1.0.0'
> @classmethod
> def parse(cls, version):
> """
> Parse version string to a VersionInfo instance.
> 
> :param version: version string
> :return: a :class:`VersionInfo` instance
> :raises: :class:`ValueError`
> :rtype: :class:`VersionInfo`
> 
> .. versionchanged:: 2.11.0
>Changed method from static to classmethod to
>allow subclasses.
> 
> >>> semver.VersionInfo.parse('3.4.5-pre.2+build.4')
> VersionInfo(major=3, minor=4, patch=5, \
> prerelease='pre.2', build='build.4')
> """
> match = cls._REGEX.match(ensure_str(version))
> if match is None:
> raise ValueError("%s is not valid SemVer string" % version)
> 
> version_parts = match.groupdict()
> 
> version_parts["major"] = int(version_parts["major"])
> version_parts["minor"] = int(version_parts["minor"])
> version_parts["patch"] = int(version_parts["patch"])
> 
> >   return cls(**version_parts)
> E   TypeError: __init__() got an unexpected keyword argument 'major'
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:734:
>  TypeError
>  test_previous_and_next_release 
> 
> fake_jira = 
> def test_previous_and_next_release(fake_jira):
> >   r = Release.from_jira("3.0.0", jira=fake_jira)
> archery/tests/test_release.py:229: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> archery/release.py:281: in from_jira
> version = jira.project_version(version, project='ARROW')
> archery/release.py:93: in project_version
> return versions[versions.index(version_string)]
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:203:
>  in wrapper
> return operator(self, other)
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:573:
>  in __eq__
> return self.compare(other) == 0
> /opt/hostedtoolcache/Python/3.5.10/x64/lib/python3.5/site-packages/semver.py:493:
>  in compare
> other = cls.parse(other)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> cls = , version = '3.0.0'
> @classmethod
> def parse(cls, version):
> """
> Parse version string to a VersionInfo instance.
> 
> :param version: version string
> :return: a :class:`VersionInfo` instance
> :raises: :class:`ValueError`
> :rtype: :class:`VersionInfo`
> 
> .. versionchanged:: 2.11.0
>Changed method from static to classmethod to
>allow subclasses.
> 
> >>> semver.VersionInfo.parse('3.

[jira] [Resolved] (ARROW-12139) [Python][Packaging] Use vcpkg to build macOS wheels

2021-03-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12139.
-
Resolution: Fixed

Issue resolved by pull request 9767
[https://github.com/apache/arrow/pull/9767]

> [Python][Packaging] Use vcpkg to build macOS wheels
> ---
>
> Key: ARROW-12139
> URL: https://issues.apache.org/jira/browse/ARROW-12139
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Manylinux and windows wheels use vcpkg as the dependency source already, port 
> the macos wheel builds to align with the setup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12149) [Dev] Archery benchmark test case is failing

2021-03-30 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12149.
-
Resolution: Duplicate

There is already an issue about this error.

> [Dev] Archery benchmark test case is failing
> 
>
> Key: ARROW-12149
> URL: https://issues.apache.org/jira/browse/ARROW-12149
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 4.0.0
>
>
> See build log 
> https://github.com/apache/arrow/pull/9767/checks?check_run_id=2220192782#step:7:113
> cc [~dianaclarke]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

2021-03-30 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311356#comment-17311356
 ] 

Daniël Heres commented on ARROW-11897:
--

[~yordan-pavlov] you can join the apache slack here: 
https://s.apache.org/slack-invite

> [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays
> --
>
> Key: ARROW-11897
> URL: https://issues.apache.org/jira/browse/ARROW-11897
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> The overall goal is to create an efficient pipeline from Parquet page data 
> into Arrow arrays, with as little intermediate conversion and memory 
> allocation as possible. It is assumed, that for best performance, we favor 
> doing fewer but larger copy operations (rather than more but smaller). 
> Such a pipeline would need to be flexible in order to enable high performance 
> implementations in several different cases:
>  (1) In some cases, such as plain-encoded number array, it might even be 
> possible to copy / create the array from a single contiguous section from a 
> page buffer. 
>  (2) In other cases, such as plain-encoded string array, since values are 
> encoded in non-contiguous slices (where value bytes are separated by length 
> bytes) in a page buffer contains multiple values, individual values will have 
> to be copied separately and it's not obvious how this can be avoided.
>  (3) Finally, in the case of bit-packing encoding and smaller numeric values, 
> page buffer data has to be decoded / expanded before it is ready to copy into 
> an arrow arrow, so a `Vec` will have to be returned instead of a slice 
> pointing to a page buffer.
> I propose that the implementation is split into three layers - (1) decoder, 
> (2) column reader and (3) array converter layers (not too dissimilar from the 
> current implementation, except it would be based on Iterators), as follows:
> *(1) Decoder layer:*
> A decoder output abstraction that enables all of the above cases and 
> minimizes intermediate memory allocation is `Iterator AsRef<[u8]>)>`.
>  Then in case (1) above, where a numeric array could be created from a single 
> contiguous byte slice, such an iterator could return a single item such as 
> `(1024, &[u8])`. 
>  In case (2) above, where each string value is encoded as an individual byte 
> slice, but it is still possible to copy directly from a page buffer, a 
> decoder iterator could return a sequence of items such as `(1, &[u8])`. 
>  And finally in case (3) above, where bit-packed values have to be 
> unpacked/expanded, and it's NOT possible to copy value bytes directly from a 
> page buffer, a decoder iterator could return items representing chunks of 
> values such as `(32, Vec)` where bit-packed values have been unpacked and 
>  the chunk size is configured for best performance.
> Another benefit of an `Iterator`-based abstraction is that it would prepare 
> the parquet crate for  migration to `async` `Stream`s (my understanding is 
> that a `Stream` is effectively an async `Iterator`).
> *(2) Column reader layer:*
> Then a higher level iterator could combine a value iterator and a (def) level 
> iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and 
> `NullSequence(count)` items from which an arrow array can be created 
> efficiently.
> In future, a higher level iterator (for the keys) could be combined with a 
> dictionary value iterator to create a dictionary array.
> *(3) Array converter layer:*
> Finally, Arrow arrays would be created from a (generic) higher-level 
> iterator, using a layer of array converters that know what the value bytes 
> and nulls mean for each type of array.
>  
> [~nevime] , [~Dandandan] , [~jorgecarleitao] let me know what you think
> Next steps:
>  * split work into smaller tasks that could be done over time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12112) [CI] No space left on device - AMD64 Conda Integration test

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12112:
---
Priority: Blocker  (was: Major)

> [CI] No space left on device - AMD64 Conda Integration test
> ---
>
> Key: ARROW-12112
> URL: https://issues.apache.org/jira/browse/ARROW-12112
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI
>Reporter: Jonathan Keane
>Priority: Blocker
>
> One example:  
> https://github.com/apache/arrow/pull/9814/checks?check_run_id=2205470543#step:8:4716
> {code}
> + npm install
> npm WARN tar ENOSPC: no space left on device, write
> npm WARN tar ENOSPC: no space left on device, write
> npm ERR! code ENOSPC
> npm ERR! syscall write
> npm ERR! errno -28
> npm ERR! nospc ENOSPC: no space left on device, write
> npm ERR! nospc There appears to be insufficient space on your system to 
> finish.
> npm ERR! nospc Clear up some disk space and try again.
> npm ERR! A complete log of this run can be found in:
> npm ERR! /root/.npm/_logs/2021-03-26T22_10_59_913Z-debug.log
> 228
> Error: `docker-compose --file 
> /home/runner/work/arrow/arrow/docker-compose.yml run --rm conda-integration` 
> exited with a non-zero exit code 228, see the process log above.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12112) [CI] No space left on device - AMD64 Conda Integration test

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12112:
---
Fix Version/s: 4.0.0

> [CI] No space left on device - AMD64 Conda Integration test
> ---
>
> Key: ARROW-12112
> URL: https://issues.apache.org/jira/browse/ARROW-12112
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI
>Reporter: Jonathan Keane
>Priority: Blocker
> Fix For: 4.0.0
>
>
> One example:  
> https://github.com/apache/arrow/pull/9814/checks?check_run_id=2205470543#step:8:4716
> {code}
> + npm install
> npm WARN tar ENOSPC: no space left on device, write
> npm WARN tar ENOSPC: no space left on device, write
> npm ERR! code ENOSPC
> npm ERR! syscall write
> npm ERR! errno -28
> npm ERR! nospc ENOSPC: no space left on device, write
> npm ERR! nospc There appears to be insufficient space on your system to 
> finish.
> npm ERR! nospc Clear up some disk space and try again.
> npm ERR! A complete log of this run can be found in:
> npm ERR! /root/.npm/_logs/2021-03-26T22_10_59_913Z-debug.log
> 228
> Error: `docker-compose --file 
> /home/runner/work/arrow/arrow/docker-compose.yml run --rm conda-integration` 
> exited with a non-zero exit code 228, see the process log above.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12112) [CI] No space left on device - AMD64 Conda Integration test

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12112:
---
Component/s: Continuous Integration

> [CI] No space left on device - AMD64 Conda Integration test
> ---
>
> Key: ARROW-12112
> URL: https://issues.apache.org/jira/browse/ARROW-12112
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI, Continuous Integration, Integration
>Reporter: Jonathan Keane
>Priority: Blocker
> Fix For: 4.0.0
>
>
> One example:  
> https://github.com/apache/arrow/pull/9814/checks?check_run_id=2205470543#step:8:4716
> {code}
> + npm install
> npm WARN tar ENOSPC: no space left on device, write
> npm WARN tar ENOSPC: no space left on device, write
> npm ERR! code ENOSPC
> npm ERR! syscall write
> npm ERR! errno -28
> npm ERR! nospc ENOSPC: no space left on device, write
> npm ERR! nospc There appears to be insufficient space on your system to 
> finish.
> npm ERR! nospc Clear up some disk space and try again.
> npm ERR! A complete log of this run can be found in:
> npm ERR! /root/.npm/_logs/2021-03-26T22_10_59_913Z-debug.log
> 228
> Error: `docker-compose --file 
> /home/runner/work/arrow/arrow/docker-compose.yml run --rm conda-integration` 
> exited with a non-zero exit code 228, see the process log above.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12112) [CI] No space left on device - AMD64 Conda Integration test

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12112:
---
Component/s: Integration

> [CI] No space left on device - AMD64 Conda Integration test
> ---
>
> Key: ARROW-12112
> URL: https://issues.apache.org/jira/browse/ARROW-12112
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI, Integration
>Reporter: Jonathan Keane
>Priority: Blocker
> Fix For: 4.0.0
>
>
> One example:  
> https://github.com/apache/arrow/pull/9814/checks?check_run_id=2205470543#step:8:4716
> {code}
> + npm install
> npm WARN tar ENOSPC: no space left on device, write
> npm WARN tar ENOSPC: no space left on device, write
> npm ERR! code ENOSPC
> npm ERR! syscall write
> npm ERR! errno -28
> npm ERR! nospc ENOSPC: no space left on device, write
> npm ERR! nospc There appears to be insufficient space on your system to 
> finish.
> npm ERR! nospc Clear up some disk space and try again.
> npm ERR! A complete log of this run can be found in:
> npm ERR! /root/.npm/_logs/2021-03-26T22_10_59_913Z-debug.log
> 228
> Error: `docker-compose --file 
> /home/runner/work/arrow/arrow/docker-compose.yml run --rm conda-integration` 
> exited with a non-zero exit code 228, see the process log above.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

2021-03-30 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311401#comment-17311401
 ] 

Jorge Leitão commented on ARROW-11897:
--

I see. To understand: is there a reason why this should be in [Parquet] instead 
of in [DataFusion]? I.e. why should we push a specific parallelism strategy to 
the library?

Asking this because the way I see it, the parquet crate can't tell which 
use-case is being used on and provide an optimal strategy for (one record per 
page, per group or per file or per files?). For example, s3 vs hdfs vs local 
file-system typically require different parallelism strategies.

My hypothesis (which may be wrong!) is that the parquet crate should offer 
"units of work" that can be divided/parallelized according to IO (e.g. s3 vs 
filesystem), memory and CPU constraints that each consumer has, and allow 
consumers of the library (e.g. DataFusion, Polars, Ballista, s3 vs hdfs vs 
file-system) to design strategies that fit their constraints the best, by 
assembling these units according to their compute model.

> [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays
> --
>
> Key: ARROW-11897
> URL: https://issues.apache.org/jira/browse/ARROW-11897
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> The overall goal is to create an efficient pipeline from Parquet page data 
> into Arrow arrays, with as little intermediate conversion and memory 
> allocation as possible. It is assumed, that for best performance, we favor 
> doing fewer but larger copy operations (rather than more but smaller). 
> Such a pipeline would need to be flexible in order to enable high performance 
> implementations in several different cases:
>  (1) In some cases, such as plain-encoded number array, it might even be 
> possible to copy / create the array from a single contiguous section from a 
> page buffer. 
>  (2) In other cases, such as plain-encoded string array, since values are 
> encoded in non-contiguous slices (where value bytes are separated by length 
> bytes) in a page buffer contains multiple values, individual values will have 
> to be copied separately and it's not obvious how this can be avoided.
>  (3) Finally, in the case of bit-packing encoding and smaller numeric values, 
> page buffer data has to be decoded / expanded before it is ready to copy into 
> an arrow arrow, so a `Vec` will have to be returned instead of a slice 
> pointing to a page buffer.
> I propose that the implementation is split into three layers - (1) decoder, 
> (2) column reader and (3) array converter layers (not too dissimilar from the 
> current implementation, except it would be based on Iterators), as follows:
> *(1) Decoder layer:*
> A decoder output abstraction that enables all of the above cases and 
> minimizes intermediate memory allocation is `Iterator AsRef<[u8]>)>`.
>  Then in case (1) above, where a numeric array could be created from a single 
> contiguous byte slice, such an iterator could return a single item such as 
> `(1024, &[u8])`. 
>  In case (2) above, where each string value is encoded as an individual byte 
> slice, but it is still possible to copy directly from a page buffer, a 
> decoder iterator could return a sequence of items such as `(1, &[u8])`. 
>  And finally in case (3) above, where bit-packed values have to be 
> unpacked/expanded, and it's NOT possible to copy value bytes directly from a 
> page buffer, a decoder iterator could return items representing chunks of 
> values such as `(32, Vec)` where bit-packed values have been unpacked and 
>  the chunk size is configured for best performance.
> Another benefit of an `Iterator`-based abstraction is that it would prepare 
> the parquet crate for  migration to `async` `Stream`s (my understanding is 
> that a `Stream` is effectively an async `Iterator`).
> *(2) Column reader layer:*
> Then a higher level iterator could combine a value iterator and a (def) level 
> iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and 
> `NullSequence(count)` items from which an arrow array can be created 
> efficiently.
> In future, a higher level iterator (for the keys) could be combined with a 
> dictionary value iterator to create a dictionary array.
> *(3) Array converter layer:*
> Finally, Arrow arrays would be created from a (generic) higher-level 
> iterator, using a layer of array converters that know what the value bytes 
> and nulls mean for each type of array.
>  
> [~nevime] , [~Dandandan] , [~jorgecarleitao] let me know what you think
> Next steps:
>  * split work into smaller tasks that could be done over time



--
This message was sent by Atlassian Jira
(v8.

[jira] [Updated] (ARROW-12143) [CI] R builds should timeout and fail after some threshold and dump the output.

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12143:
---
Component/s: Continuous Integration

> [CI] R builds should timeout and fail after some threshold and dump the 
> output.
> ---
>
> Key: ARROW-12143
> URL: https://issues.apache.org/jira/browse/ARROW-12143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration, R
>Reporter: Weston Pace
>Priority: Major
>
> Currently, if an R test hangs, then it is very difficult to determine what 
> the root cause is because it just outputs "checking tests".  It also slows 
> down the CI pipeline because it doesn't time out for 6 hours.
> I'm hoping we can instead kill the test after some unreasonable amount of 
> time has passed and dump whatever output has been generated so far.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11675) [CI] Resolve ctest failures in crossbow task test-build-vcpkg-win

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11675:
---
Component/s: Continuous Integration

> [CI] Resolve ctest failures in crossbow task test-build-vcpkg-win
> -
>
> Key: ARROW-11675
> URL: https://issues.apache.org/jira/browse/ARROW-11675
> Project: Apache Arrow
>  Issue Type: Task
>  Components: CI, Continuous Integration
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>
> The crossbow task *test-build-vcpkg-win* runs the script 
> {{dev/tasks/vcpkg-tests/cpp-build-vcpkg.bat}}. This runs {{ctest}} which 
> shows two failing tests:
>  * {{TestStatisticsSortOrder/0.MinMax}}
>  * {{TestStatistic.Int32Extremums}}
> Full logs from a recent run of this task: 
> [https://github.com/ursacomputing/crossbow/actions?query=branch:actions-99-github-test-build-vcpkg-win]
> Diagnose and resolve these failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12040) [R] [CI] [C++] test-r-rstudio-r-base-3.6-opensuse15 timing out during tests

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12040:
---
Component/s: Continuous Integration

> [R] [CI] [C++] test-r-rstudio-r-base-3.6-opensuse15 timing out during tests
> ---
>
> Key: ARROW-12040
> URL: https://issues.apache.org/jira/browse/ARROW-12040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, CI, Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The nightly job test-r-rstudio-r-base-3.6-opensuse15 has timed out on tests 
> for (at least) the last 3 nightly jobs now.
> Everything seems fine until the `checking tests ...` stage of R CMD CHECK 
> which hangs for long enough that Azure kills the job because the total job 
> has exceed 360 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10031) Support Java benchmark in Ursabot

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10031:
---
Component/s: Continuous Integration

> Support Java benchmark in Ursabot
> -
>
> Key: ARROW-10031
> URL: https://issues.apache.org/jira/browse/ARROW-10031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: CI, Continuous Integration, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Based on [the 
> suggestion|https://mail-archives.apache.org/mod_mbox/arrow-dev/202008.mbox/%3ccabnn7+q35j7qwshjbx8omdewkt+f1p_m7r1_f6szs4dqc+l...@mail.gmail.com%3e],
>  Ursabot will support Java benchmarks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11608) [CI] turbodbc integration tests are failing (build isue)

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11608:
---
Component/s: Continuous Integration

> [CI] turbodbc integration tests are failing (build isue)
> 
>
> Key: ARROW-11608
> URL: https://issues.apache.org/jira/browse/ARROW-11608
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Both turbodbc builds are failing, see eg 
> https://github.com/ursacomputing/crossbow/runs/1885201762
> It seems a failure to build turbodbc: 
> {code}
> /build/turbodbc /
> -- The CXX compiler identification is GNU 9.3.0
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /opt/conda/envs/arrow/bin/x86_64-conda-linux-gnu-c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Build type: Debug
> CMake Error at CMakeLists.txt:14 (add_subdirectory):
>   add_subdirectory given source "pybind11" which is not an existing
>   directory.
> -- Found GTest: /opt/conda/envs/arrow/lib/libgtest.so  
> -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
> components: locale 
> -- Detecting unixODBC library
> --   Found header files at: /opt/conda/envs/arrow/include
> --   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
> -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
> components: system date_time locale 
> -- Detecting unixODBC library
> --   Found header files at: /opt/conda/envs/arrow/include
> --   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
> -- Found Boost: /opt/conda/envs/arrow/include (found version "1.74.0") found 
> components: system 
> -- Detecting unixODBC library
> --   Found header files at: /opt/conda/envs/arrow/include
> --   Found library at: /opt/conda/envs/arrow/lib/libodbc.so
> CMake Error at cpp/turbodbc_python/Library/CMakeLists.txt:3 
> (pybind11_add_module):
>   Unknown CMake command "pybind11_add_module".
> -- Configuring incomplete, errors occurred!
> See also "/build/turbodbc/CMakeFiles/CMakeOutput.log".
> See also "/build/turbodbc/CMakeFiles/CMakeError.log".
> 1
> Error: `docker-compose --file 
> /home/runner/work/crossbow/crossbow/arrow/docker-compose.yml run --rm -e 
> SETUPTOOLS_SCM_PRETEND_VERSION=3.1.0.dev174 conda-python-turbodbc` exited 
> with a non-zero exit code 1, see the process log above.
> {code}
> cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9978) [Rust] Umbrella issue for clippy integration

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9978:
--
Component/s: Continuous Integration

> [Rust] Umbrella issue for clippy integration
> 
>
> Key: ARROW-9978
> URL: https://issues.apache.org/jira/browse/ARROW-9978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: CI, Continuous Integration, Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Priority: Major
>
> This is an umbrella issue to collate outstanding and new tasks to enable 
> clippy integration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9765) [C++][CI][Windows] link errors on windows when using testing::HasSubstr match

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9765:
--
Component/s: Continuous Integration

> [C++][CI][Windows] link errors on windows when using testing::HasSubstr match
> -
>
> Key: ARROW-9765
> URL: https://issues.apache.org/jira/browse/ARROW-9765
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI, Continuous Integration
>Reporter: Micah Kornfield
>Priority: Minor
>
> I tried using using testing::HasSubstr in a test in 
> cpp/src/parquet/arrow/arrow_schema_test.cc  and it resulted in appveyor CI 
> failing to link on windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10798) [CI] Do not run R, Go and Ruby on any change to any dockerfile.

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10798:
---
Component/s: Continuous Integration

> [CI] Do not run R, Go and Ruby on any change to any dockerfile.
> ---
>
> Key: ARROW-10798
> URL: https://issues.apache.org/jira/browse/ARROW-10798
> Project: Apache Arrow
>  Issue Type: Task
>  Components: CI, Continuous Integration
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11633) [CI] [Documentation] Maven default skin not found

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11633:
---
Component/s: Continuous Integration

> [CI] [Documentation] Maven default skin not found
> -
>
> Key: ARROW-11633
> URL: https://issues.apache.org/jira/browse/ARROW-11633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI, Continuous Integration, Documentation
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The docs nightly build has been failing for a few days with:
> {code}
> 2021-02-15T07:26:05.8699084Z [INFO] Arrow Algorithms 
> ... SUCCESS [ 15.498 s]
> 2021-02-15T07:26:05.8700080Z [INFO] Arrow Performance Benchmarks 
> 4.0.0-SNAPSHOT  FAILURE [15:45 min]
> 2021-02-15T07:26:05.8700992Z [INFO] 
> 
> 2021-02-15T07:26:05.8701554Z [INFO] BUILD FAILURE
> 2021-02-15T07:26:05.8702879Z [INFO] 
> 
> 2021-02-15T07:26:05.8703563Z [INFO] Total time: 29:46 min (Wall Clock)
> 2021-02-15T07:26:05.8704366Z [INFO] Finished at: 2021-02-15T07:26:05Z
> 2021-02-15T07:26:05.8705209Z [INFO] 
> 
> 2021-02-15T07:26:05.8707032Z [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-site-plugin:3.3:site (default-site) on project 
> arrow-performance: SiteToolException: ArtifactResolutionException: Unable to 
> find skin: Could not transfer artifact 
> org.apache.maven.skins:maven-default-skin:jar:1.0 from/to central 
> (https://repo.maven.apache.org/maven2): Connection timed out (Read failed)
> 2021-02-15T07:26:05.8708593Z [ERROR]   
> org.apache.maven.skins:maven-default-skin:jar:1.0
> 2021-02-15T07:26:05.8709136Z [ERROR] 
> 2021-02-15T07:26:05.8709618Z [ERROR] from the specified remote repositories:
> 2021-02-15T07:26:05.8710313Z [ERROR]   apache.snapshots 
> (https://repository.apache.org/snapshots, releases=false, snapshots=true),
> 2021-02-15T07:26:05.8711035Z [ERROR]   central 
> (https://repo.maven.apache.org/maven2, releases=true, snapshots=false)
> 2021-02-15T07:26:05.8711775Z [ERROR] -> [Help 1]
> 2021-02-15T07:26:05.8712234Z [ERROR] 
> 2021-02-15T07:26:05.8712989Z [ERROR] To see the full stack trace of the 
> errors, re-run Maven with the -e switch.
> 2021-02-15T07:26:05.8714323Z [ERROR] Re-run Maven using the -X switch to 
> enable full debug logging.
> 2021-02-15T07:26:05.8714873Z [ERROR] 
> 2021-02-15T07:26:05.8715478Z [ERROR] For more information about the errors 
> and possible solutions, please read the following articles:
> 2021-02-15T07:26:05.8716188Z [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> 2021-02-15T07:26:05.8716717Z [ERROR] 
> 2021-02-15T07:26:05.8717187Z [ERROR] After correcting the problems, you can 
> resume the build with the command
> 2021-02-15T07:26:05.8717946Z [ERROR]   mvn  -rf :arrow-performance
> 2021-02-15T07:26:07.0376588Z 1
> 2021-02-15T07:26:07.1031165Z Error: `docker-compose --file 
> /home/vsts/work/1/s/arrow/docker-compose.yml run --rm -e 
> SETUPTOOLS_SCM_PRETEND_VERSION=3.1.0.dev183 ubuntu-docs` exited with a 
> non-zero exit code 1, see the process log above.
> {code}
> And, indeed the 1.0 version of maven-default-skin is not at 
> https://repository.apache.org/content/groups/snapshots/org/apache/maven/skins/maven-default-skin/
>  (though it does appear to be 
> https://repo.maven.apache.org/maven2/org/apache/maven/skins/maven-default-skin/)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11140) [Rust] [CI] Try out buildkite

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-11140:
---
Component/s: Continuous Integration

> [Rust] [CI] Try out buildkite
> -
>
> Key: ARROW-11140
> URL: https://issues.apache.org/jira/browse/ARROW-11140
> Project: Apache Arrow
>  Issue Type: Task
>  Components: CI, Continuous Integration, Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Let's make a test to validate whether we can use buildkite on our own flows, 
> which adds a lot of options in architectures and environments that we can 
> test stuff on.
> Goal: validate that we can use buildkite on the rust builds.
> Requirements:
>  # pipeline starts when a PR is made
>  # result is sent back to github and users can access its logs
>  # we can use caches (e.g. 
> [https://github.com/danthorpe/cache-buildkite-plugin] )
>  # we can actually run the builds
>  # we can limit the builds to only be triggered when certain parts of the 
> repo change (i.e. not run when only C++ code changed)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8697) [CI] Add descriptions to the docker-compose images

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8697:
--
Component/s: Continuous Integration

> [CI] Add descriptions to the docker-compose images
> --
>
> Key: ARROW-8697
> URL: https://issues.apache.org/jira/browse/ARROW-8697
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> Add docstring like descriptions to the docker-compose services.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9426) [CI] Maybe redundant 'entry' key in .pre-commit-config.yaml

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9426:
--
Component/s: Continuous Integration

> [CI] Maybe redundant 'entry' key in .pre-commit-config.yaml
> ---
>
> Key: ARROW-9426
> URL: https://issues.apache.org/jira/browse/ARROW-9426
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI, Continuous Integration
>Reporter: Chen
>Priority: Minor
>
> Hi, occasionally I find a mior issue in the '.pre-commit-config.yaml' file 
> that.
> ```yaml
> - id: cmake-format
>   name: CMake Format
>   language: python
>   entry: bash -c "pip install cmake-format && python run-cmake-format.py 
> --check"
>   entry: echo
>   files: ^(.*/CMakeLists.txt|.*.cmake)$
> ```
> Maybe the item `entry: echo` is redundant. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7304) [C++] clang-tidy diagnostics not emitted for most headers

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-7304:
--
Component/s: Continuous Integration

> [C++] clang-tidy diagnostics not emitted for most headers
> -
>
> Key: ARROW-7304
> URL: https://issues.apache.org/jira/browse/ARROW-7304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI, Continuous Integration
>Affects Versions: 0.15.1
>Reporter: Elvis Stansvik
>Priority: Minor
>
> The {{HeaderFilterRegex}} in {{.clang-tidy}} is written
> {code}
> HeaderFilterRegex: 
> '^(.*codegen.*|.*_generated.*|.*windows_compatibility.h|.*pyarrow_api.h|.*pyarrow_lib.h|.*python/config.h|.*python/platform.h|.*thirdparty/ae/.*|.*vendored/.*|.*RcppExports.cpp.*|)$'
> {code}
> as if it was an exclusion filter, but {{HeaderFilterRegex}} is in fact an 
> inclusion mechanism. So clang-tidy diagnostics are not emitted for I guess 
> most of the headers in Arrow.
> See 
> [https://github.com/apache/arrow/commit/72b553147e4bd47e100fbfd58ed49041561b7bc4#r36225046]
>  which is where I came across this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8012) [C++][CI] Set CTEST_PARALLEL_LEVEL to $concurrency

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8012:
--
Component/s: Continuous Integration

> [C++][CI] Set CTEST_PARALLEL_LEVEL to $concurrency
> --
>
> Key: ARROW-8012
> URL: https://issues.apache.org/jira/browse/ARROW-8012
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, CI, Continuous Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>
> Currently the default {{test}} target runs serially while {{unittest}} 
> arbitrarily uses 4 threads. On many systems that's suboptimal. The 
> environment variable {{CTEST_PARALLEL_LEVEL}} can be set to run tests in 
> parallel, and we should probably have a good default for it (like the 
> hardware concurrency)
> https://cmake.org/cmake/help/latest/manual/ctest.1.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7131) [GLib][CI] Fail to execute lua examples in the MacOS build

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-7131:
--
Component/s: Continuous Integration

> [GLib][CI] Fail to execute lua examples in the MacOS build
> --
>
> Key: ARROW-7131
> URL: https://issues.apache.org/jira/browse/ARROW-7131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration, GLib
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
>
> Fails to load 'lgi.corelgilua51' despite that lgi is installed in the macOS 
> build.
> References:
> - https://github.com/apache/arrow/blob/master/.github/workflows/ruby.yml#L77
> - https://github.com/apache/arrow/blob/master/ci/scripts/c_glib_test.sh#L35



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12100) [C#] Cannot round-trip record batch with PyArrow

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-12100:
--

Assignee: Antoine Pitrou

> [C#] Cannot round-trip record batch with PyArrow
> 
>
> Key: ARROW-12100
> URL: https://issues.apache.org/jira/browse/ARROW-12100
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, C++, Python
>Affects Versions: 3.0.0
>Reporter: Tanguy Fautre
>Assignee: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: ArrowSharedMemory_20210326.zip, 
> ArrowSharedMemory_20210326_2.zip, ArrowSharedMemory_20210329.zip
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Has anyone ever tried to round-trip a record batch between Arrow C# and 
> PyArrow? I can't get PyArrow to read the data correctly.
> For context, I'm trying to do Arrow data-frames inter-process communication 
> between C# and Python using shared memory (local TCP/IP is also an 
> alternative). Ideally, I wouldn't even have to serialise the data and could 
> just share the Arrow in-memory representation directly, but I'm not sure this 
> is even possible with Apache Arrow. Full source code as attachment.
> *C#*
> {code:c#}
> using (var stream = sharedMemory.CreateStream(0, 0, 
> MemoryMappedFileAccess.ReadWrite))
> {
> var recordBatch = /* ... */
> using (var writer = new ArrowFileWriter(stream, recordBatch.Schema, 
> leaveOpen: true))
> {
> writer.WriteRecordBatch(recordBatch);
> writer.WriteEnd();
> }
> }
> {code}
> *Python*
> {code:python}
> shmem = open_shared_memory(args)
> address = get_shared_memory_address(shmem)
> buf = pa.foreign_buffer(address, args.sharedMemorySize)
> stream = pa.input_stream(buf)
> reader = pa.ipc.open_stream(stream)
> {code}
> Unfortunately, it fails with the following error: {{pyarrow.lib.ArrowInvalid: 
> Expected to read 1330795073 metadata bytes, but only read 1230}}.
> I can see that the memory content starts with 
> {{ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00}}. It seems 
> that using the API calls above, PyArrow reads "ARRO" as the length of the 
> metadata.
> I assume I'm using the API incorrectly. Has anyone got a working example?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12100) [C#] Cannot round-trip record batch with PyArrow

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-12100.

Resolution: Fixed

Issue resolved by pull request 9837
[https://github.com/apache/arrow/pull/9837]

> [C#] Cannot round-trip record batch with PyArrow
> 
>
> Key: ARROW-12100
> URL: https://issues.apache.org/jira/browse/ARROW-12100
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, C++, Python
>Affects Versions: 3.0.0
>Reporter: Tanguy Fautre
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: ArrowSharedMemory_20210326.zip, 
> ArrowSharedMemory_20210326_2.zip, ArrowSharedMemory_20210329.zip
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Has anyone ever tried to round-trip a record batch between Arrow C# and 
> PyArrow? I can't get PyArrow to read the data correctly.
> For context, I'm trying to do Arrow data-frames inter-process communication 
> between C# and Python using shared memory (local TCP/IP is also an 
> alternative). Ideally, I wouldn't even have to serialise the data and could 
> just share the Arrow in-memory representation directly, but I'm not sure this 
> is even possible with Apache Arrow. Full source code as attachment.
> *C#*
> {code:c#}
> using (var stream = sharedMemory.CreateStream(0, 0, 
> MemoryMappedFileAccess.ReadWrite))
> {
> var recordBatch = /* ... */
> using (var writer = new ArrowFileWriter(stream, recordBatch.Schema, 
> leaveOpen: true))
> {
> writer.WriteRecordBatch(recordBatch);
> writer.WriteEnd();
> }
> }
> {code}
> *Python*
> {code:python}
> shmem = open_shared_memory(args)
> address = get_shared_memory_address(shmem)
> buf = pa.foreign_buffer(address, args.sharedMemorySize)
> stream = pa.input_stream(buf)
> reader = pa.ipc.open_stream(stream)
> {code}
> Unfortunately, it fails with the following error: {{pyarrow.lib.ArrowInvalid: 
> Expected to read 1330795073 metadata bytes, but only read 1230}}.
> I can see that the memory content starts with 
> {{ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00}}. It seems 
> that using the API calls above, PyArrow reads "ARRO" as the length of the 
> metadata.
> I assume I'm using the API incorrectly. Has anyone got a working example?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12136) [Rust][DataFusion] Reduce default batch_size to 8192

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-12136.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9834
[https://github.com/apache/arrow/pull/9834]

> [Rust][DataFusion] Reduce default batch_size to 8192
> 
>
> Key: ARROW-12136
> URL: https://issues.apache.org/jira/browse/ARROW-12136
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12106) [Rust][DataFusion] Support `SELECT * from information_schema.tables`

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-12106:

Component/s: Rust - DataFusion

> [Rust][DataFusion] Support `SELECT * from information_schema.tables`
> 
>
> Key: ARROW-12106
> URL: https://issues.apache.org/jira/browse/ARROW-12106
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12106) [Rust][DataFusion] Support `SELECT * from information_schema.tables`

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-12106.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9818
[https://github.com/apache/arrow/pull/9818]

> [Rust][DataFusion] Support `SELECT * from information_schema.tables`
> 
>
> Key: ARROW-12106
> URL: https://issues.apache.org/jira/browse/ARROW-12106
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11973) [Rust] Boolean AND/OR kernels should follow sql behaviour regarding null values

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11973.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9772
[https://github.com/apache/arrow/pull/9772]

> [Rust] Boolean AND/OR kernels should follow sql behaviour regarding null 
> values
> ---
>
> Key: ARROW-11973
> URL: https://issues.apache.org/jira/browse/ARROW-11973
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Affects Versions: 3.0.0
>Reporter: Jörn Horstmann
>Assignee: Christoph Schulze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The AND/OR boolean kernels currently have the same null handling as other 
> binary expressions, if either the left or right input is NULL then the result 
> will be NULL. The standard sql behaviour is different:
> OR: If one input is TRUE then the result will be TRUE even if the other input 
> is NULL
> AND: If one input is FALSE then the result will be FALSE regardless of the 
> other input
> This behaviour makes sense if you think of NULL as meaning UNKNOWN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11973) [Rust] Boolean AND/OR kernels should follow sql behaviour regarding null values

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-11973:
---

Assignee: Christoph Schulze  (was: Jörn Horstmann)

> [Rust] Boolean AND/OR kernels should follow sql behaviour regarding null 
> values
> ---
>
> Key: ARROW-11973
> URL: https://issues.apache.org/jira/browse/ARROW-11973
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Affects Versions: 3.0.0
>Reporter: Jörn Horstmann
>Assignee: Christoph Schulze
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The AND/OR boolean kernels currently have the same null handling as other 
> binary expressions, if either the left or right input is NULL then the result 
> will be NULL. The standard sql behaviour is different:
> OR: If one input is TRUE then the result will be TRUE even if the other input 
> is NULL
> AND: If one input is FALSE then the result will be FALSE regardless of the 
> other input
> This behaviour makes sense if you think of NULL as meaning UNKNOWN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12143) [CI] R builds should timeout and fail after some threshold and dump the output.

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12143:
---
Labels: pull-request-available  (was: )

> [CI] R builds should timeout and fail after some threshold and dump the 
> output.
> ---
>
> Key: ARROW-12143
> URL: https://issues.apache.org/jira/browse/ARROW-12143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Continuous Integration, R
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, if an R test hangs, then it is very difficult to determine what 
> the root cause is because it just outputs "checking tests".  It also slows 
> down the CI pipeline because it doesn't time out for 6 hours.
> I'm hoping we can instead kill the test after some unreasonable amount of 
> time has passed and dump whatever output has been generated so far.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
 from decimal import Decimal
values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
 decs_from_values = [Decimal(v) for v in values_floats] # Decimal
 decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
 decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}
table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
 pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
 print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal 

data_dict = \{"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using 
from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)

print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting pyarrow.table that contains mixed-precision Decimals using  
> parquet.write_table creates a parquet that contains invalid data/values.
> In the example below the first value of data_decimal is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
>  from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
>  decs_from_values = [Decimal(v) for v in values_floats] # Decimal
>  decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
> using from_float
>  decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float, # python Decimal using from_float
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
>  pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
>  print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)
abdel alfahham created ARROW-12150:
--

 Summary: [Python] Invalid data when Decimal is exported to parquet 
 Key: ARROW-12150
 URL: https://issues.apache.org/jira/browse/ARROW-12150
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: - macOS Big Sur 11.2.1
- python 3.8.2
Reporter: abdel alfahham


Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal 

data_dict = \{"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using 
from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)

print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting pyarrow.table that contains mixed-precision Decimals using  
> parquet.write_table creates a parquet that contains invalid data/values.
> In the example below the first value of data_decimal is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
 from decimal import Decimal
values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
 decs_from_values = [Decimal(v) for v in values_floats] # Decimal
 decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
using from_float
 decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float, # python Decimal using from_float
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}
table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
 pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
 print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting pyarrow.table that contains mixed-precision Decimals using  
> parquet.write_table creates a parquet that contains invalid data/values.
> In the example below the first value of data_decimal is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats] # Decimal 
> using from_float
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float, # python Decimal using from_float
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
\{Decimal('579.119995117187954747350886464118957519531250')} in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting pyarrow.table that contains mixed-precision Decimals using  
> parquet.write_table creates a parquet that contains invalid data/values.
> In the example below the first value of data_decimal is turned from 
> \{Decimal('579.119995117187954747350886464118957519531250')} in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Docs Text:   (was: bu)

> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250')} in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
_parquet.write_table_ creates a parquet that contains invalid data/values.

In the example below the first value of _data_decimal_ is turned from 
Decimal('579.119995117187954747350886464118957519531250')} in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting pyarrow.table that contains mixed-precision Decimals using  
parquet.write_table creates a parquet that contains invalid data/values.

In the example below the first value of data_decimal is turned from 
\{Decimal('579.119995117187954747350886464118957519531250')} in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250')} in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
_parquet.write_table_ creates a parquet that contains invalid data/values.

In the example below the first value of _data_decimal_ is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
_parquet.write_table_ creates a parquet that contains invalid data/values.

In the example below the first value of _data_decimal_ is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table

to Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Invalid data when Decimal is exported to parquet

2021-03-30 Thread abdel alfahham (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

abdel alfahham updated ARROW-12150:
---
Description: 
Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
_parquet.write_table_ creates a parquet that contains invalid data/values.

In the example below the first value of _data_decimal_ is turned from 
Decimal('579.119995117187954747350886464118957519531250') in the 
pyarrow table

to Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 

  was:
Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
_parquet.write_table_ creates a parquet that contains invalid data/values.

In the example below the first value of _data_decimal_ is turned from 
Decimal('579.119995117187954747350886464118957519531250')} in the 
pyarrow table to 
Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
parquet.

 
{code:java}
import pyarrow
from decimal import Decimal

values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
decs_from_values = [Decimal(v) for v in values_floats] # Decimal
decs_from_float = [Decimal.from_float(v) for v in values_floats]
decs_str = [Decimal(str(v)) for v in values_floats] # Decimal

data_dict = {"data_decimal": decs_from_values, # python Decimal
 "data_decimal_from_float": decs_from_float,
 "data_float":values_floats, # python floats
 "data_dec_str": decs_str}

table = pyarrow.table(data=data_dict)
print(table.to_pydict()) # before saving
pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
after saving
{code}
 


> [Python] Invalid data when Decimal is exported to parquet 
> --
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table
> to Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Bad type inference of mixed-precision Decimals

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12150:
---
Summary: [Python] Bad type inference of mixed-precision Decimals  (was: 
[Python] Invalid data when Decimal is exported to parquet )

> [Python] Bad type inference of mixed-precision Decimals
> ---
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12150) [Python] Bad type inference of mixed-precision Decimals

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12150:
---
Fix Version/s: 4.0.0

> [Python] Bad type inference of mixed-precision Decimals
> ---
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
> Fix For: 4.0.0
>
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12150) [Python] Bad type inference of mixed-precision Decimals

2021-03-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311542#comment-17311542
 ] 

Antoine Pitrou commented on ARROW-12150:


[~jorisvandenbossche]

> [Python] Bad type inference of mixed-precision Decimals
> ---
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12150) [Python] Bad type inference of mixed-precision Decimals

2021-03-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311541#comment-17311541
 ] 

Antoine Pitrou commented on ARROW-12150:


Actually the problem is that converting from Python Decimals with mixed 
precision produces the wrong Arrow decimal type:
{code:python}
>>> arr = pa.array([Decimal('1.234'), Decimal('456.7')])
>>> arr

[
  1.234,
  456.700
]
>>> arr.type
Decimal128Type(decimal128(4, 3))
# BUG: 456.7 doesn't fit in decimal128(4, 3)!
{code}

You can workaround the issue by specifying the column types explicitly when 
creating your table, for example:
{code:python}
decs_from_values = pa.array([Decimal(v) for v in values_floats], 
type=pa.decimal256(54, 51))
{code}

As a sidenote, it is a bad practice to instantiate decimals from floating-point 
numbers, because floating-point numbers can't exactly represent all decimal 
numbers, which can lead to excessive digits in the results, e.g.:
{code:python}
>>> Decimal(579.119995117188)
Decimal('579.11999511718795474735088646411895751953125')
>>> Decimal("579.119995117188")
Decimal('579.119995117188')
{code}


> [Python] Bad type inference of mixed-precision Decimals
> ---
>
> Key: ARROW-12150
> URL: https://issues.apache.org/jira/browse/ARROW-12150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: - macOS Big Sur 11.2.1
> - python 3.8.2
>Reporter: abdel alfahham
>Priority: Major
>
> Exporting _pyarrow.table_ that contains mixed-precision _Decimals_ using  
> _parquet.write_table_ creates a parquet that contains invalid data/values.
> In the example below the first value of _data_decimal_ is turned from 
> Decimal('579.119995117187954747350886464118957519531250') in the 
> pyarrow table to 
> Decimal('-378.68971792399258172661600550482428224218070136475136') in the 
> parquet.
>  
> {code:java}
> import pyarrow
> from decimal import Decimal
> values_floats = [579.119995117188, 6.4084741211, 2.0] # floats
> decs_from_values = [Decimal(v) for v in values_floats] # Decimal
> decs_from_float = [Decimal.from_float(v) for v in values_floats]
> decs_str = [Decimal(str(v)) for v in values_floats] # Decimal
> data_dict = {"data_decimal": decs_from_values, # python Decimal
>  "data_decimal_from_float": decs_from_float,
>  "data_float":values_floats, # python floats
>  "data_dec_str": decs_str}
> table = pyarrow.table(data=data_dict)
> print(table.to_pydict()) # before saving
> pyarrow.parquet.write_table(table, "./pyarrow_decimal.parquet") # saving
> print(pyarrow.parquet.read_table("./pyarrow_decimal.parquet").to_pydict()) # 
> after saving
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12089) [Doc] Fix warnings when building Sphinx docs

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-12089:
--

Assignee: Antoine Pitrou

> [Doc] Fix warnings when building Sphinx docs
> 
>
> Key: ARROW-12089
> URL: https://issues.apache.org/jira/browse/ARROW-12089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 4.0.0
>
>
> Some warnings are due to invalid markup or ambiguous cross-references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12108) [Rust][DataFusion] Support `SHOW TABLES`

2021-03-30 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-12108:

Component/s: Rust - DataFusion

> [Rust][DataFusion] Support `SHOW TABLES`
> 
>
> Key: ARROW-12108
> URL: https://issues.apache.org/jira/browse/ARROW-12108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12108) [Rust][DataFusion] Support `SHOW TABLES`

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12108:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Support `SHOW TABLES`
> 
>
> Key: ARROW-12108
> URL: https://issues.apache.org/jira/browse/ARROW-12108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12089) [Doc] Fix warnings when building Sphinx docs

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12089:
---
Labels: pull-request-available  (was: )

> [Doc] Fix warnings when building Sphinx docs
> 
>
> Key: ARROW-12089
> URL: https://issues.apache.org/jira/browse/ARROW-12089
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Some warnings are due to invalid markup or ambiguous cross-references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12120) [Rust] Generate random arrays and batches

2021-03-30 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-12120.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9824
[https://github.com/apache/arrow/pull/9824]

> [Rust] Generate random arrays and batches
> -
>
> Key: ARROW-12120
> URL: https://issues.apache.org/jira/browse/ARROW-12120
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I need a random data generator for the Parquet <> Arrow integration. It takes 
> me a while to craft a test case, so being able to create random data would 
> make it a bit easier to improve test coverage and catch edge-cases in the 
> code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12068) [Python] Stop using distutils

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12068:
---
Fix Version/s: (was: 5.0.0)
   4.0.0

> [Python] Stop using distutils
> -
>
> Key: ARROW-12068
> URL: https://issues.apache.org/jira/browse/ARROW-12068
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 4.0.0
>
>
> According to [PEP 632|https://www.python.org/dev/peps/pep-0632/], distutils 
> will be deprecated in Python 3.10 and removed in 3.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12068) [Python] Stop using distutils

2021-03-30 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-12068:
--

Assignee: Antoine Pitrou

> [Python] Stop using distutils
> -
>
> Key: ARROW-12068
> URL: https://issues.apache.org/jira/browse/ARROW-12068
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: 4.0.0
>
>
> According to [PEP 632|https://www.python.org/dev/peps/pep-0632/], distutils 
> will be deprecated in Python 3.10 and removed in 3.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12068) [Python] Stop using distutils

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12068:
---
Labels: pull-request-available  (was: )

> [Python] Stop using distutils
> -
>
> Key: ARROW-12068
> URL: https://issues.apache.org/jira/browse/ARROW-12068
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> According to [PEP 632|https://www.python.org/dev/peps/pep-0632/], distutils 
> will be deprecated in Python 3.10 and removed in 3.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12152) [Docs] Add Jira component + summary conventions to the docs

2021-03-30 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-12152:
--

 Summary: [Docs] Add Jira component + summary conventions to the 
docs
 Key: ARROW-12152
 URL: https://issues.apache.org/jira/browse/ARROW-12152
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Jonathan Keane
Assignee: Jonathan Keane


When we duplicate the component information to the summary, we have a number of 
conventions about these, let's add them to the documentation around ~  
https://arrow.apache.org/docs/developers/contributing.html#tips-for-using-jira 
to give people submitting a chance at getting this right:

For the components:
* Continuous Integration — summary: [CI]
* Developer Tools — summary: [Dev]
* Documentation — summary: [Docs]

All others should be the same for components and summary (e.g. component: 
Python summary: [Python])





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12151) [Docs] Add Jira component + summary conventions to the docs

2021-03-30 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-12151:
--

 Summary: [Docs] Add Jira component + summary conventions to the 
docs
 Key: ARROW-12151
 URL: https://issues.apache.org/jira/browse/ARROW-12151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Jonathan Keane
Assignee: Jonathan Keane


When we duplicate the component information to the summary, we have a number of 
conventions about these, let's add them to the documentation around ~  
https://arrow.apache.org/docs/developers/contributing.html#tips-for-using-jira 
to give people submitting a chance at getting this right:

For the components:
* Continuous Integration — summary: [CI]
* Developer Tools — summary: [Dev]
* Documentation — summary: [Docs]

All others should be the same for components and summary (e.g. component: 
Python summary: [Python])





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12152) [Docs] Add Jira component + summary conventions to the docs

2021-03-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12152.
--
Resolution: Fixed

> [Docs] Add Jira component + summary conventions to the docs
> ---
>
> Key: ARROW-12152
> URL: https://issues.apache.org/jira/browse/ARROW-12152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Trivial
>
> When we duplicate the component information to the summary, we have a number 
> of conventions about these, let's add them to the documentation around ~  
> https://arrow.apache.org/docs/developers/contributing.html#tips-for-using-jira
>  to give people submitting a chance at getting this right:
> For the components:
> * Continuous Integration — summary: [CI]
> * Developer Tools — summary: [Dev]
> * Documentation — summary: [Docs]
> All others should be the same for components and summary (e.g. component: 
> Python summary: [Python])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12153) [Rust] [Parquet] Return file metadata after writing Parquet file

2021-03-30 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12153:
--

 Summary: [Rust] [Parquet] Return file metadata after writing 
Parquet file
 Key: ARROW-12153
 URL: https://issues.apache.org/jira/browse/ARROW-12153
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale
Assignee: Neville Dipale


Parquet writers like delta-rs rely on the Parquet metadata to write file-level 
statistics for file pruning purposes.

We currently do not expose these stats, requiring the writer to read the file 
that has just been written, to get the stats. This is more problematic for 
in-memory sinks, as there is currently no way of getting the metadata from the 
sink before it's persisted.

Explore if we can expose these stats to the writer, to make the above easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12154) [C++][Gandiva] Fix gandiva crash in certain VM/CPU combinations

2021-03-30 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-12154:
--

 Summary: [C++][Gandiva] Fix gandiva crash in certain VM/CPU 
combinations
 Key: ARROW-12154
 URL: https://issues.apache.org/jira/browse/ARROW-12154
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


When running gandiva in a VM where the VM doesn't provide all the features of 
the host cpu, specifically vector instructions like avx512 which needs vm 
support (because VM is older version and doesn't support them, or passthrough 
is disabled for these features), llvm::sys::getHostCPUName detects the 
processor with these features and so gandiva generates jit compiled code with 
these vector instructions, which the guest OS is unable to execute and hence 
faults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12153) [Rust] [Parquet] Return file metadata after writing Parquet file

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12153:
---
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Return file metadata after writing Parquet file
> 
>
> Key: ARROW-12153
> URL: https://issues.apache.org/jira/browse/ARROW-12153
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parquet writers like delta-rs rely on the Parquet metadata to write 
> file-level statistics for file pruning purposes.
> We currently do not expose these stats, requiring the writer to read the file 
> that has just been written, to get the stats. This is more problematic for 
> in-memory sinks, as there is currently no way of getting the metadata from 
> the sink before it's persisted.
> Explore if we can expose these stats to the writer, to make the above easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12155) [R] Require Table columns to be same length

2021-03-30 Thread Ian Cook (Jira)
Ian Cook created ARROW-12155:


 Summary: [R] Require Table columns to be same length
 Key: ARROW-12155
 URL: https://issues.apache.org/jira/browse/ARROW-12155
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 3.0.0
Reporter: Ian Cook
Assignee: Ian Cook
 Fix For: 4.0.0


An error is thrown if the user attempts to create a RecordBatch with different 
length arrays:
{code:java}
> arrow::record_batch(a=1:5, b = 42)
Error: Invalid: All arrays must have the same length {code}
But no error is thrown if the user attempts to create a Table with different 
length columns. Instead we get garbage in the table:
{code:java}
Table$create(a=1:5, b = 42) %>% collect()
# A tibble: 5 x 2
  a b
   
1 1 4.20e+  1
2 2 6.94e-310
3 3 6.94e-310
4 4 6.94e-310
5 5 6.94e-310  {code}
Change the behavior for Table creation to match the current behavior of 
RecordBatch creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12155) [R] Require Table columns to be same length

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12155:
---
Labels: pull-request-available  (was: )

> [R] Require Table columns to be same length
> ---
>
> Key: ARROW-12155
> URL: https://issues.apache.org/jira/browse/ARROW-12155
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> An error is thrown if the user attempts to create a RecordBatch with 
> different length arrays:
> {code:java}
> > arrow::record_batch(a=1:5, b = 42)
> Error: Invalid: All arrays must have the same length {code}
> But no error is thrown if the user attempts to create a Table with different 
> length columns. Instead we get garbage in the table:
> {code:java}
> Table$create(a=1:5, b = 42) %>% collect()
> # A tibble: 5 x 2
>   a b
>
> 1 1 4.20e+  1
> 2 2 6.94e-310
> 3 3 6.94e-310
> 4 4 6.94e-310
> 5 5 6.94e-310  {code}
> Change the behavior for Table creation to match the current behavior of 
> RecordBatch creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12154) [C++][Gandiva] Fix gandiva crash in certain OS/CPU combinations

2021-03-30 Thread Projjal Chanda (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Projjal Chanda updated ARROW-12154:
---
Summary: [C++][Gandiva] Fix gandiva crash in certain OS/CPU combinations  
(was: [C++][Gandiva] Fix gandiva crash in certain VM/CPU combinations)

> [C++][Gandiva] Fix gandiva crash in certain OS/CPU combinations
> ---
>
> Key: ARROW-12154
> URL: https://issues.apache.org/jira/browse/ARROW-12154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>
> When running gandiva in a VM where the VM doesn't provide all the features of 
> the host cpu, specifically vector instructions like avx512 which needs vm 
> support (because VM is older version and doesn't support them, or passthrough 
> is disabled for these features), llvm::sys::getHostCPUName detects the 
> processor with these features and so gandiva generates jit compiled code with 
> these vector instructions, which the guest OS is unable to execute and hence 
> faults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12154) [C++][Gandiva] Fix gandiva crash in certain OS/CPU combinations

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12154:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Fix gandiva crash in certain OS/CPU combinations
> ---
>
> Key: ARROW-12154
> URL: https://issues.apache.org/jira/browse/ARROW-12154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When running gandiva in a VM where the VM doesn't provide all the features of 
> the host cpu, specifically vector instructions like avx512 which needs vm 
> support (because VM is older version and doesn't support them, or passthrough 
> is disabled for these features), llvm::sys::getHostCPUName detects the 
> processor with these features and so gandiva generates jit compiled code with 
> these vector instructions, which the guest OS is unable to execute and hence 
> faults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12153) [Rust] [Parquet] Return file metadata after writing Parquet file

2021-03-30 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-12153:
---
Component/s: Rust

> [Rust] [Parquet] Return file metadata after writing Parquet file
> 
>
> Key: ARROW-12153
> URL: https://issues.apache.org/jira/browse/ARROW-12153
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parquet writers like delta-rs rely on the Parquet metadata to write 
> file-level statistics for file pruning purposes.
> We currently do not expose these stats, requiring the writer to read the file 
> that has just been written, to get the stats. This is more problematic for 
> in-memory sinks, as there is currently no way of getting the metadata from 
> the sink before it's persisted.
> Explore if we can expose these stats to the writer, to make the above easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12076) [Rust] Fix build

2021-03-30 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-12076:
---
Summary: [Rust] Fix build  (was: Fix build)

> [Rust] Fix build
> 
>
> Key: ARROW-12076
> URL: https://issues.apache.org/jira/browse/ARROW-12076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> There was a logical conflict between 
> https://github.com/apache/arrow/commit/eebf64b00e3a26f61c4bebec7241a0b24d27ec67
>  which removed the Arc in `ArrayData` and  
> https://github.com/apache/arrow/commit/8dd6abbb72b6b8958f3b2f35512bdadcaf43066f
>  which optimized the compute kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12156) [Rust] Calculate the size of a RecordBatch

2021-03-30 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12156:
--

 Summary: [Rust] Calculate the size of a RecordBatch
 Key: ARROW-12156
 URL: https://issues.apache.org/jira/browse/ARROW-12156
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale


We can compute the size of an array, but there's no facility yet to compute the 
size of a recordbatch.

This is useful if we need to measure the size of data we're about to write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9878) [Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory

2021-03-30 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-9878:

Component/s: Documentation

> [Python] table to_pandas self_destruct=True + split_blocks=True cannot 
> prevent doubling memory
> --
>
> Key: ARROW-9878
> URL: https://issues.apache.org/jira/browse/ARROW-9878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Affects Versions: 0.17.1, 1.0.0
>Reporter: Weichen Xu
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: t001.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>  
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
>   writer = pa.ipc.new_stream(f1, batch.schema)
>   for i in range(10):
>   writer.write_batch(batch)
>   writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
>   reader = pa.ipc.open_stream(f1)
>   batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9878) [Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory

2021-03-30 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-9878.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9730
[https://github.com/apache/arrow/pull/9730]

> [Python] table to_pandas self_destruct=True + split_blocks=True cannot 
> prevent doubling memory
> --
>
> Key: ARROW-9878
> URL: https://issues.apache.org/jira/browse/ARROW-9878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1, 1.0.0
>Reporter: Weichen Xu
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: t001.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>  
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
>   writer = pa.ipc.new_stream(f1, batch.schema)
>   for i in range(10):
>   writer.write_batch(batch)
>   writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
>   reader = pa.ipc.open_stream(f1)
>   batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12157) Implement like function for regex expressions

2021-03-30 Thread Jira
João Victor Huguenin created ARROW-12157:


 Summary: Implement like function for regex expressions
 Key: ARROW-12157
 URL: https://issues.apache.org/jira/browse/ARROW-12157
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: João Victor Huguenin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12157) Implement like function for regex expressions

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12157:
---
Labels: pull-request-available  (was: )

> Implement like function for regex expressions
> -
>
> Key: ARROW-12157
> URL: https://issues.apache.org/jira/browse/ARROW-12157
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: João Victor Huguenin
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12158) [Rust][DataFusion]: Implement support for the `now()` sql function

2021-03-30 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-12158:
---

 Summary: [Rust][DataFusion]: Implement support for the `now()` sql 
function
 Key: ARROW-12158
 URL: https://issues.apache.org/jira/browse/ARROW-12158
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb
Assignee: Andrew Lamb


Usecase: selecting the last 5 minutes of data

I would like to be able to run queries like this:
```
select * from cpu where time > now() - interval '3' minute;
```

Proposed implementation:
follow postgres functions:  
https://www.postgresql.org/docs/current/functions-datetime.html




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12159) [Rust][DataFusion] Support grouping on expressions

2021-03-30 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-12159:
---

 Summary: [Rust][DataFusion] Support grouping on expressions
 Key: ARROW-12159
 URL: https://issues.apache.org/jira/browse/ARROW-12159
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb


Usecase:

I want to group based on time windows (as defined by the `date_trunc` 
function). 

For example, given the table:

{code}
+--+---+-+-+--+---+--+---++---+-+++
| cpu  | host  | time| usage_guest | 
usage_guest_nice | usage_idle| usage_iowait | usage_irq | usage_nice | 
usage_softirq | usage_steal | usage_system   | usage_user |
+--+---+-+-+--+---+--+---++---+-+++
| cpu0 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 65.30408773649165 | 0| 0 | 0  | 0 | 
0   | 18.444666002000673 | 16.251246261217506 |
| cpu1 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 84.43113772402216 | 0| 0 | 0  | 0 | 
0   | 3.193612774446795  | 12.37524950097282  |
| cpu2 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 65.96806387199344 | 0| 0 | 0  | 0 | 
0   | 15.469061876247794 | 18.56287425146831  |
| cpu3 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 84.0478564307993  | 0| 0 | 0  | 0 | 
0   | 3.0907278165770684 | 12.861415752863932 |
| cpu4 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 63.21036889281897 | 0| 0 | 0  | 0 | 
0   | 13.758723828377473 | 23.030907278223218 |
| cpu5 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 83.94815553242313 | 0| 0 | 0  | 0 | 
0   | 2.991026919231221  | 13.0608175473346   |
| cpu6 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 70.85828343276965 | 0| 0 | 0  | 0 | 
0   | 12.87425149699077  | 16.26746506987651  |
| cpu7 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 83.9321357287122  | 0| 0 | 0  | 0 | 
0   | 3.093812375243205  | 12.974051896176206 |
| cpu8 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 74.80079681313936 | 0| 0 | 0  | 0 | 
0   | 10.756972111708253 | 14.442231075949556 |
| cpu9 | MacBook-Pro.local | 16171301300 | 0   | 0  
  | 83.84845463618315 | 0| 0 | 0  | 0 | 
0   | 3.0907278165434624 | 13.060817547316466 |
+--+---+-+-+--+---+--+---++---+-+++

{code}

I want to be able to find the min and max usage time grouped by minute

{code}
select 
  date_trunc('minute', cast (time as timestamp)), 
  min(usage_user), 
  max(usage_user) 
from
  cpu 
group by 
  date_trunc('minute', cast (time as timestamp)), min(usage_user)"
{code}

Or alternately

{code}
select 
  date_trunc('minute', cast (time as timestamp)), 
  min(usage_user), 
  max(usage_user) 
from
  cpu 
group by 
  1
{code}



{code}Instead as of now I get a planning error:
Error preparing query Error during planning: Projection references 
non-aggregate values
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12128) [CI][Crossbow] Remove (or fix) test-ubuntu-16.04-cpp job

2021-03-30 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-12128.
--
Resolution: Fixed

Issue resolved by pull request 9845
[https://github.com/apache/arrow/pull/9845]

> [CI][Crossbow] Remove (or fix) test-ubuntu-16.04-cpp job
> 
>
> Key: ARROW-12128
> URL: https://issues.apache.org/jira/browse/ARROW-12128
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Continuous Integration
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ARROW-8049 increased the minimum cmake version required for bundled thrift to 
> 3.10, which is not what 16.04 ships. We removed packaging jobs in ARROW-11910 
> because it is EOL in April 2021, but we still have a nightly job that is 
> failing and other related materials (Dockerfile etc.) for 16.04.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12160) [Rust] Add an `into_inner()` method to ipc::writer::StreamWriter

2021-03-30 Thread Eric Burden (Jira)
Eric Burden created ARROW-12160:
---

 Summary: [Rust] Add an `into_inner()` method to 
ipc::writer::StreamWriter
 Key: ARROW-12160
 URL: https://issues.apache.org/jira/browse/ARROW-12160
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 4.0.0
Reporter: Eric Burden
Assignee: Eric Burden


Add an `into_inner()` method to ipc::writer::StreamWriter, allowing users to 
recover the underlying writer, consuming the StreamWriter. Essentially exposes 
`into_inner()` from the BufWriter contained in the StreamWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9731) [C++][Dataset] Port "head" method from R to C++ Dataset Scanner

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9731:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Port "head" method from R to C++ Dataset Scanner
> ---
>
> Key: ARROW-9731
> URL: https://issues.apache.org/jira/browse/ARROW-9731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-9665 (https://github.com/apache/arrow/pull/7913) added amongst other 
> things a {{head}} method for Dataset in R:
> https://github.com/apache/arrow/blob/586c060c8b1851f1077911fae6d02a10ed83e7fb/r/src/dataset.cpp#L266-L282
> It might be nice to move this to C++ and expose it on the python side as well 
> (and since it's written already in C++ on the R side, it should be relatively 
> straightforward to port I assume)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11336) [C++][Doc] Improve Developing on Windows docs

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11336:
---
Labels: pull-request-available  (was: )

> [C++][Doc] Improve Developing on Windows docs
> -
>
> Key: ARROW-11336
> URL: https://issues.apache.org/jira/browse/ARROW-11336
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Update and improve the "Developing on Windows" docs page:
>  * Add instructions for using Visual Studio 2019
>  * Add instructions for option to use vcpkg instead of conda for build 
> dependencies
>  ** Mention that when you use {{ARROW_DEPENDENCY_SOURCE=VCPKG}}, vcpkg will 
> (depending on its configuration) actually download, build, and install the 
> C++ library dependencies for you if it can't find them; this differs from 
> other dependency sources which require a prior installation
>  * Describe required Visual Studio configuration
>  * Improve some ambiguous instructions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11180) [Developer] cmake-format pre-commit hook doesn't run

2021-03-30 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11180.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9045
[https://github.com/apache/arrow/pull/9045]

> [Developer] cmake-format pre-commit hook doesn't run
> 
>
> Key: ARROW-11180
> URL: https://issues.apache.org/jira/browse/ARROW-11180
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Marco Gorelli
>Assignee: Marco Gorelli
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently, `entry echo` overwrites the actual command and so nothing gets run



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11858) [GLib] Gandiva Filter in GLib

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11858:
---
Labels: pull-request-available  (was: )

> [GLib] Gandiva Filter in GLib
> -
>
> Key: ARROW-11858
> URL: https://issues.apache.org/jira/browse/ARROW-11858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 3.0.0
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was trying to use Gandiva under ruby.  I was looking able to get a 
> Projection working because that is annotated.  I was trying to do a Gandiva 
> filter but this doesn't seem to be available with Gandiva. It is not listed 
> in the Glib documentation.  Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12145) [Developer][Archery] Flaky test: test_static_runner_from_json

2021-03-30 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-12145.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9843
[https://github.com/apache/arrow/pull/9843]

> [Developer][Archery] Flaky test: test_static_runner_from_json
> -
>
> Key: ARROW-12145
> URL: https://issues.apache.org/jira/browse/ARROW-12145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Diana Clarke
>Assignee: Diana Clarke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This test assumes:
> {code}
>  artificial_reg, normal = RunnerComparator(contender, baseline).comparisons
> {code}
> When the return order could be:
> {code}
>  normal, artificial_reg = RunnerComparator(contender, baseline).comparisons
> {code}
> The return order of {{comparisons}} isn't deterministic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12161) [C++] Async streaming CSV reader deadlocking when being run synchronously from datasets

2021-03-30 Thread Weston Pace (Jira)
Weston Pace created ARROW-12161:
---

 Summary: [C++] Async streaming CSV reader deadlocking when being 
run synchronously from datasets
 Key: ARROW-12161
 URL: https://issues.apache.org/jira/browse/ARROW-12161
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


ARROW-11887 added async to the streaming CSV reader.  In order to keep 
backwards compatibility the old sync API simply calls the async API and waits 
for it to finish.  However, that wait cannot happen safely in a "nested" 
context (e.g. dataset reading).

For example, imagine two cores.  The dataset read launches two CSV scans.  Each 
scan occupies a core waiting for a future.  Those futures are being filled by 
I/O threads.  The I/O threads finish and go to transfer.  The transfer cannot 
happen because the CPU executor is filled.

This will be fixed as part of ARROW-7001 but that still some ways away.  An 
easier change might be to take some of the 7001 changes and include them as 
part of the 11887 feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12161) [C++] Async streaming CSV reader deadlocking when being run synchronously from datasets

2021-03-30 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-12161:

Issue Type: Bug  (was: Improvement)

> [C++] Async streaming CSV reader deadlocking when being run synchronously 
> from datasets
> ---
>
> Key: ARROW-12161
> URL: https://issues.apache.org/jira/browse/ARROW-12161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> ARROW-11887 added async to the streaming CSV reader.  In order to keep 
> backwards compatibility the old sync API simply calls the async API and waits 
> for it to finish.  However, that wait cannot happen safely in a "nested" 
> context (e.g. dataset reading).
> For example, imagine two cores.  The dataset read launches two CSV scans.  
> Each scan occupies a core waiting for a future.  Those futures are being 
> filled by I/O threads.  The I/O threads finish and go to transfer.  The 
> transfer cannot happen because the CPU executor is filled.
> This will be fixed as part of ARROW-7001 but that still some ways away.  An 
> easier change might be to take some of the 7001 changes and include them as 
> part of the 11887 feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

2021-03-30 Thread David Wales (Jira)
David Wales created ARROW-12162:
---

 Summary: [R] read_parquet returns Invalid UTF8 payload
 Key: ARROW-12162
 URL: https://issues.apache.org/jira/browse/ARROW-12162
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 3.0.0
 Environment: Windows 10
R 4.0.3
arrow 3.0.0
dbplyr 2.0.0
dplyr 1.0.2
Reporter: David Wales
 Attachments: bad_char.rds

h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file:

 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload":

 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following:

 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

2021-03-30 Thread David Wales (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wales updated ARROW-12162:

Description: 
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file:

 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload":

 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following:

 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}


> [R] read_parquet returns Invalid UTF8 payload
> -
>
> Key: ARROW-12162
> URL: https://issues.apache.org/jira/browse/ARROW-12162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
> Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>Reporter: David Wales
>Priority: Major
> Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

2021-03-30 Thread David Wales (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wales updated ARROW-12162:

Description: 
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}

Is it possibly related to this issue?
https://issues.apache.org/jira/browse/ARROW-12007

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}


> [R] read_parquet returns Invalid UTF8 payload
> -
>
> Key: ARROW-12162
> URL: https://issues.apache.org/jira/browse/ARROW-12162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
> Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>Reporter: David Wales
>Priority: Major
> Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
> Is it possibly related to this issue?
> https://issues.apache.org/jira/browse/ARROW-12007



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

2021-03-30 Thread David Wales (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wales updated ARROW-12162:

Description: 
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}

Is it possibly related to this issue?
https://issues.apache.org/jira/browse/ARROW-12007


> [R] read_parquet returns Invalid UTF8 payload
> -
>
> Key: ARROW-12162
> URL: https://issues.apache.org/jira/browse/ARROW-12162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
> Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>Reporter: David Wales
>Priority: Major
> Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
> h2. Possibly related issues
> https://issues.apache.org/jira/browse/ARROW-12007



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12160) [Rust] Add an `into_inner()` method to ipc::writer::StreamWriter

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12160:
---
Labels: pull-request-available  (was: )

> [Rust] Add an `into_inner()` method to ipc::writer::StreamWriter
> 
>
> Key: ARROW-12160
> URL: https://issues.apache.org/jira/browse/ARROW-12160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 4.0.0
>Reporter: Eric Burden
>Assignee: Eric Burden
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 10m
>  Remaining Estimate: 3h 50m
>
> Add an `into_inner()` method to ipc::writer::StreamWriter, allowing users to 
> recover the underlying writer, consuming the StreamWriter. Essentially 
> exposes `into_inner()` from the BufWriter contained in the StreamWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

2021-03-30 Thread David Wales (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wales updated ARROW-12162:

Description: 
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
 

What I would really like is a way to tell arrow "This data is latin1 encoded. 
Please convert it to UTF-8 before you save it as a Parquet file".

Or alternatively "This Parquet file contains latin1 encoded data".
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007


> [R] read_parquet returns Invalid UTF8 payload
> -
>
> Key: ARROW-12162
> URL: https://issues.apache.org/jira/browse/ARROW-12162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
> Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>Reporter: David Wales
>Priority: Major
> Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
>  
> What I would really like is a way to tell arrow "This data is latin1 encoded. 
> Please convert it to UTF-8 before you save it as a Parquet file".
> Or alternatively "This Parquet file contains latin1 encoded data".
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
> h2. Possibly related issues
> https://issues.apache.org/jira/browse/ARROW-12007



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12161) [C++] Async streaming CSV reader deadlocking when being run synchronously from datasets

2021-03-30 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312026#comment-17312026
 ] 

Weston Pace commented on ARROW-12161:
-

So there are a few choices here.  In the meantime, if it is going to take some 
time to fix, I'd recommend reverting ARROW-11887 while the fix is worked 
through.  I've created [https://github.com/apache/arrow/pull/9859] as a 
convenience.

Option #1: Leave 11887 out, include it as part of ARROW-7001

-Pros: No wasted work

-Cons: Makes ARROW-7001 an even larger change.

 

Option #2: Bring part of ARROW-7001 in and patch it in.  Basically add 
`supports_async()` and `ExecuteAsync()` to `ScanTask` and then modify 
`Scanner::ToTable` so that it will create a task group (for synchronous scan 
tasks) AND collect a set of futures (for async scan tasks).  It will then await 
both of those one after the other.  This should avoid the nested dataset issue. 
 I've prototyped this today and it should work but it'll take me a little bit 
of work to polish it which I could do tomorrow.

-Pros: Makes ARROW-7001 a smaller change

-Cons: Potentially delays ARROW-7001 review while this is worked through / Some 
wasted work.

> [C++] Async streaming CSV reader deadlocking when being run synchronously 
> from datasets
> ---
>
> Key: ARROW-12161
> URL: https://issues.apache.org/jira/browse/ARROW-12161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> ARROW-11887 added async to the streaming CSV reader.  In order to keep 
> backwards compatibility the old sync API simply calls the async API and waits 
> for it to finish.  However, that wait cannot happen safely in a "nested" 
> context (e.g. dataset reading).
> For example, imagine two cores.  The dataset read launches two CSV scans.  
> Each scan occupies a core waiting for a future.  Those futures are being 
> filled by I/O threads.  The I/O threads finish and go to transfer.  The 
> transfer cannot happen because the CPU executor is filled.
> This will be fixed as part of ARROW-7001 but that still some ways away.  An 
> easier change might be to take some of the 7001 changes and include them as 
> part of the 11887 feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12163) [Java] Make compression levels configurable.

2021-03-30 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-12163:
---

 Summary: [Java] Make compression levels configurable.
 Key: ARROW-12163
 URL: https://issues.apache.org/jira/browse/ARROW-12163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Micah Kornfield


Today we use default compression levels in compressors, these should be 
configurable via constructor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12138) [Go][IPC]

2021-03-30 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-12138.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9836
[https://github.com/apache/arrow/pull/9836]

> [Go][IPC]
> -
>
> Key: ARROW-12138
> URL: https://issues.apache.org/jira/browse/ARROW-12138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Integration
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Update generated flatbuffer files for Golang apache arrow so that newer IPC 
> features like compression can get implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12121) [Rust] [Parquet] Arrow writer benchmarks

2021-03-30 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-12121.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9825
[https://github.com/apache/arrow/pull/9825]

> [Rust] [Parquet] Arrow writer benchmarks
> 
>
> Key: ARROW-12121
> URL: https://issues.apache.org/jira/browse/ARROW-12121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The common concern with Parquet's Arrow readers and writers is that they're 
> slow.
> My diagnosis is that we rely on a chain of processes, which introduces 
> overhead.
> For example, writing an Arrow RecordBatch involves the following:
> 1. Iterate through arrays to create def/rep levels
> 2. Extract Parquet primitive values from arrays using these levels
> 3. Write primitive values, validating them in the process (when they already 
> should be validated)
> 4. Split the already materialised values into small batches for Parquet 
> chunks (consider where we have 1e6 values in a batch)
> 5. Write these batches, computing the stats of each batch, and encoding values
> The above is as a side-effect of convenience, as it would likely require a 
> lot more effort to bypass some of the steps.
> I have ideas around going from step 1 to 5 directly, but won't know if it's 
> better if there aren't performance benchmarks. I also struggle to see if I'm 
> making improvements while I clean up the writer code, especially removing the 
> allocations that I created to reduce the complexity of the level calculations.
> With ARROW-12120 (random array & batch generator), it becomes more convenient 
> to benchmark (and test many combinations of) the Arrow writer.
> I would thus like to start adding benchmarks for the Arrow writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12121) [Rust] [Parquet] Arrow writer benchmarks

2021-03-30 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-12121:
--

Assignee: Neville Dipale

> [Rust] [Parquet] Arrow writer benchmarks
> 
>
> Key: ARROW-12121
> URL: https://issues.apache.org/jira/browse/ARROW-12121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The common concern with Parquet's Arrow readers and writers is that they're 
> slow.
> My diagnosis is that we rely on a chain of processes, which introduces 
> overhead.
> For example, writing an Arrow RecordBatch involves the following:
> 1. Iterate through arrays to create def/rep levels
> 2. Extract Parquet primitive values from arrays using these levels
> 3. Write primitive values, validating them in the process (when they already 
> should be validated)
> 4. Split the already materialised values into small batches for Parquet 
> chunks (consider where we have 1e6 values in a batch)
> 5. Write these batches, computing the stats of each batch, and encoding values
> The above is as a side-effect of convenience, as it would likely require a 
> lot more effort to bypass some of the steps.
> I have ideas around going from step 1 to 5 directly, but won't know if it's 
> better if there aren't performance benchmarks. I also struggle to see if I'm 
> making improvements while I clean up the writer code, especially removing the 
> allocations that I created to reduce the complexity of the level calculations.
> With ARROW-12120 (random array & batch generator), it becomes more convenient 
> to benchmark (and test many combinations of) the Arrow writer.
> I would thus like to start adding benchmarks for the Arrow writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12122) [Python] Cannot install via pip. M1 mac

2021-03-30 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312067#comment-17312067
 ] 

Kouhei Sutou commented on ARROW-12122:
--

Thanks. Do you have knowledge to build arm64 wheel for macOS on GitHub Actions, 
Azure Pipelines, Drone Cloud and so on?

> [Python] Cannot install via pip. M1 mac
> ---
>
> Key: ARROW-12122
> URL: https://issues.apache.org/jira/browse/ARROW-12122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bastien Boutonnet
>Priority: Major
>
> when doing {{pip install pyarrow --no-use-pep517}}
> {noformat}
> Collecting pyarrow
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
> Requirement already satisfied: numpy>=1.16.6 in 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/lib/python3.8/site-packages
>  (from pyarrow) (1.20.2)
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (setup.py) ... error
>  ERROR: Command errored out with exit status 1:
>  command: 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/bin/python
>  -u -c 'import sys, setuptools, tokenize; sys.argv[0] = 
> '"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';
>  
> __file__='"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';f=getattr(tokenize,
>  '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' 
> bdist_wheel -d 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-wheel-vpkwqzyi
>  cwd: 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/
>  Complete output (238 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.macosx-11.2-arm64-3.8
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cffi.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/dataset.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compute.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_ipc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_convert_builtin.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_misc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_gandiva.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/strategies.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_adhoc_memory_leak.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/arrow_7980.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/util.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/t

  1   2   >