[jira] [Updated] (ARROW-17304) [C++][Compute] Improve error messages in aggregate test
[ https://issues.apache.org/jira/browse/ARROW-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-17304: --- Priority: Minor (was: Major) > [C++][Compute] Improve error messages in aggregate test > --- > > Key: ARROW-17304 > URL: https://issues.apache.org/jira/browse/ARROW-17304 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Print actual values when comparison failed to help debugging. > See https://github.com/apache/arrow/issues/12681 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17304) [C++][Compute] Improve error messages in aggregate test
[ https://issues.apache.org/jira/browse/ARROW-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17304. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13814 [https://github.com/apache/arrow/pull/13814] > [C++][Compute] Improve error messages in aggregate test > --- > > Key: ARROW-17304 > URL: https://issues.apache.org/jira/browse/ARROW-17304 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Print actual values when comparison failed to help debugging. > See https://github.com/apache/arrow/issues/12681 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17341) musl does not define _SC_LEVEL1_DCACHE_SIZE etc
[ https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17341. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13819 [https://github.com/apache/arrow/pull/13819] > musl does not define _SC_LEVEL1_DCACHE_SIZE etc > --- > > Key: ARROW-17341 > URL: https://issues.apache.org/jira/browse/ARROW-17341 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 >Reporter: Duncan Bellamy >Assignee: Yibo Cai >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE > introduced in commit > [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71] > > there is a fallback function > [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326] > > but that uses the structure with the defines in as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17341) [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc
[ https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-17341: --- Summary: [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc (was: musl does not define _SC_LEVEL1_DCACHE_SIZE etc) > [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc > - > > Key: ARROW-17341 > URL: https://issues.apache.org/jira/browse/ARROW-17341 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 >Reporter: Duncan Bellamy >Assignee: Yibo Cai >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE > introduced in commit > [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71] > > there is a fallback function > [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326] > > but that uses the structure with the defines in as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17341) [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc
[ https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-17341: --- Fix Version/s: 9.0.1 > [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc > - > > Key: ARROW-17341 > URL: https://issues.apache.org/jira/browse/ARROW-17341 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 >Reporter: Duncan Bellamy >Assignee: Yibo Cai >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0, 9.0.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE > introduced in commit > [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71] > > there is a fallback function > [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326] > > but that uses the structure with the defines in as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14126) [C++] Add locale support for relevant string compute functions
[ https://issues.apache.org/jira/browse/ARROW-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577295#comment-17577295 ] Eduardo Ponce commented on ARROW-14126: --- I agree that a more general approach would be a better solution. Closing this issue as it is not relevant in its current form. > [C++] Add locale support for relevant string compute functions > -- > > Key: ARROW-14126 > URL: https://issues.apache.org/jira/browse/ARROW-14126 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Eduardo Ponce >Priority: Major > Labels: kernel > > String compute functions do not make use of locale settings for case changing > transformations, string comparisons, and number to string casting. Arrow does > provides UTF-8 string kernels to handle localization standardization. It > would be good to add locale support for string kernels that are affected by > it. > The following are string functions that take a `locale` option as its second > argument: > * str_to_lower > * str_to_upper > * str_to_title -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-14126) [C++] Add locale support for relevant string compute functions
[ https://issues.apache.org/jira/browse/ARROW-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce resolved ARROW-14126. --- Resolution: Won't Do > [C++] Add locale support for relevant string compute functions > -- > > Key: ARROW-14126 > URL: https://issues.apache.org/jira/browse/ARROW-14126 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Eduardo Ponce >Priority: Major > Labels: kernel > > String compute functions do not make use of locale settings for case changing > transformations, string comparisons, and number to string casting. Arrow does > provides UTF-8 string kernels to handle localization standardization. It > would be good to add locale support for string kernels that are affected by > it. > The following are string functions that take a `locale` option as its second > argument: > * str_to_lower > * str_to_upper > * str_to_title -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-13570) [C++][Compute] Additional scalar ASCII kernels can reuse original offsets buffer
[ https://issues.apache.org/jira/browse/ARROW-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce resolved ARROW-13570. --- Resolution: Duplicate > [C++][Compute] Additional scalar ASCII kernels can reuse original offsets > buffer > > > Key: ARROW-13570 > URL: https://issues.apache.org/jira/browse/ARROW-13570 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Eduardo Ponce >Priority: Major > > Some ASCII scalar string kernels are able to reuse the original offsets > buffer, so they are not preallocated in the output (use > *MemAllocation::NO_PREALLOCATE* during registration). Currently, only kernels > that apply a transformation to each character independently via > [StringDataTransform|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L590-L631] > support the no preallocation policy. But there are additional string kernels > that do not modify the length (nor offsets) of the input string but apply > scalar transforms that depend on neighboring characters. > This issue should extend/create *StringDataTransform* to take multiple input > transforms in order to support *MemAllocation::NO_PREALLOCATE* policy for > additional scalar ASCII kernels (e.g., _ascii_title_). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0
Oliver Klein created ARROW-17352: Summary: Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0 Key: ARROW-17352 URL: https://issues.apache.org/jira/browse/ARROW-17352 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 9.0.0 Environment: Windows10 Reporter: Oliver Klein Attachments: arrow9error.PNG Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0. It worked when stored with version 8 and earlier. Windows Parquet Viewer: 2.3.5 and 2.3.6 pyarrow version: 9.0.0 Error: System.AggregateException: One or more errors occured. ---> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in DataColumnReader.cs: line 259 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577308#comment-17577308 ] Joris Van den Bossche commented on ARROW-17335: --- AFAIK it is not (yet) possible to do inline type annotations in cython code (for type checking purposes, see the links in https://github.com/apache/arrow/pull/6676 as well), so I think that basically means we need to use the stub file approach? (but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there) > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oliver Klein updated ARROW-17352: - Description: Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0. It worked when stored with version 8 and earlier. Windows Parquet Viewer: 2.3.5 and 2.3.6 pyarrow version: 9.0.0 Error: System.AggregateException: One or more errors occured. ---> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in DataColumnReader.cs: line 259 After further checking I found that it seems the problem seems to relate to a default parquet version change. When I use pyarrow 9 and configure version to 1.0 it works again from the windows tool - when its 2.4 its not working (or supported in the windows tool). df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') Question might be if such a default change is a bug or a feature. was: Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0. It worked when stored with version 8 and earlier. Windows Parquet Viewer: 2.3.5 and 2.3.6 pyarrow version: 9.0.0 Error: System.AggregateException: One or more errors occured. ---> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in DataColumnReader.cs: line 259 > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0 > - > > Key: ARROW-17352 > URL: https://issues.apache.org/jira/browse/ARROW-17352 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 > Environment: Windows10 >Reporter: Oliver Klein >Priority: Critical > Attachments: arrow9error.PNG > > > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0. It worked when stored with version 8 and earlier. > Windows Parquet Viewer: 2.3.5 and 2.3.6 > pyarrow version: 9.0.0 > Error: System.AggregateException: One or more errors occured. ---> > Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. > at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in > DataColumnReader.cs: line 259 > > After further checking I found that it seems the problem seems to relate to a > default parquet version change. > When I use pyarrow 9 and configure version to 1.0 it works again from the > windows tool - when its 2.4 its not working (or supported in the windows > tool). > df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') > df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') > Question might be if such a default change is a bug or a feature. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577314#comment-17577314 ] Joris Van den Bossche commented on ARROW-17335: --- {quote}Well it's not really a duplicate of ARROW-8175. The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library. My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.{quote} It's indeed not exactly the same. But _in practice_, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also _check_ that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones)) > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943 ] Nicola Crane deleted comment on ARROW-15943: -- was (Author: thisisnic): There is more user interest in implementing this feature: https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_ > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577321#comment-17577321 ] Nicola Crane commented on ARROW-15943: -- There is more user interest in implementing this feature: https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_ > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oliver Klein updated ARROW-17352: - Description: Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0. It worked when stored with version 8 and earlier. Windows Parquet Viewer: 2.3.5 and 2.3.6 pyarrow version: 9.0.0 Error: System.AggregateException: One or more errors occured. ---> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in DataColumnReader.cs: line 259 After further checking I found that it seems the problem seems to relate to a default parquet version change. When I use pyarrow 9 and configure version to 1.0 it works again from the windows tool - when its 2.4 its not working (or supported in the windows tool). df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') Question might be if such a default change is a bug or a feature. Finally found: * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280) So probably its a feature - and we need to adapt our code was: Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0. It worked when stored with version 8 and earlier. Windows Parquet Viewer: 2.3.5 and 2.3.6 pyarrow version: 9.0.0 Error: System.AggregateException: One or more errors occured. ---> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in DataColumnReader.cs: line 259 After further checking I found that it seems the problem seems to relate to a default parquet version change. When I use pyarrow 9 and configure version to 1.0 it works again from the windows tool - when its 2.4 its not working (or supported in the windows tool). df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') Question might be if such a default change is a bug or a feature. > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0 > - > > Key: ARROW-17352 > URL: https://issues.apache.org/jira/browse/ARROW-17352 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 > Environment: Windows10 >Reporter: Oliver Klein >Priority: Critical > Attachments: arrow9error.PNG > > > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0. It worked when stored with version 8 and earlier. > Windows Parquet Viewer: 2.3.5 and 2.3.6 > pyarrow version: 9.0.0 > Error: System.AggregateException: One or more errors occured. ---> > Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. > at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in > DataColumnReader.cs: line 259 > > After further checking I found that it seems the problem seems to relate to a > default parquet version change. > When I use pyarrow 9 and configure version to 1.0 it works again from the > windows tool - when its 2.4 its not working (or supported in the windows > tool). > df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') > df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') > Question might be if such a default change is a bug or a feature. > Finally found: > * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280) > So probably its a feature - and we need to adapt our code > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577326#comment-17577326 ] Nicola Crane commented on ARROW-15943: -- There is more user interest in implementing this feature: https://stackoverflow.com/questions/73283669/r-arrowread-datasetmyfolder-only-read-some-folders-files > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17320) [Python] Refine pyarrow.parquet API exposure
[ https://issues.apache.org/jira/browse/ARROW-17320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577330#comment-17577330 ] Joris Van den Bossche commented on ARROW-17320: --- On second thought, I think it is maybe fine to just "break" this is a next release by explicitly defining the {{\_\_all\_\_}}. If someone would run into it, it is easy to fix in your code. But starting to add deprecation warnings for those sounds quite onerous for the value it would provide. (we only need to remember to add {{\_filters_to_expression}} to the list, since that is used by other projects) > [Python] Refine pyarrow.parquet API exposure > > > Key: ARROW-17320 > URL: https://issues.apache.org/jira/browse/ARROW-17320 > Project: Apache Arrow > Issue Type: Improvement > Components: Parquet, Python >Reporter: Miles Granger >Priority: Major > > Spawning from ARROW-17106, moving code from `pyarrow/parquet/_init_{_}` to > `pyarrow/parquet/core` and re-exporting in `_i{_}{_}nit_{_}` to maintain the > same functionality. > [pyarrow._init_|https://github.com/apache/arrow/blob/master/python/pyarrow/__init__.py]is > very careful about what is exposed through the public API by prefixing > private symbols with underscores, even imports. > What's exposed at the top level of `{{{}pyarrow.parquet{}}}`, however, is not > so careful. API calls such as `{{{}pq.FileSystem{}}}`, `{{{}pq.pa.Array{}}}`, > `{{{}pq.json{}}}` are all valid and should probably be designated as private > attributes in {{{}pyarrow.parquet{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577332#comment-17577332 ] H. Vetinari commented on ARROW-17110: - Just FYI, conda-forge [now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides static builds of C++11/C++14 as "escape hatches" for packages that cannot yet use the C++17 dynamic libs. This takes the heat off a little bit - i.e. it allows packages to move at their own speed w.r.t to C++, as opposed to forcing a conda-forge-wide choice for abseil -, but note that the next abseil version will still drop C++11 compatibility, so a move to at least C++14 will still be necessary in the near-ish future. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577332#comment-17577332 ] H. Vetinari edited comment on ARROW-17110 at 8/9/22 10:10 AM: -- Just FYI, conda-forge [now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides static builds of C\+\+11/C\+\+14 as "escape hatches" for packages that cannot yet use the C\+\+17 dynamic libs. This takes the heat off a little bit - i.e. it allows packages to move at their own speed w.r.t to C\+\+, as opposed to forcing a conda-forge-wide choice for abseil -, but note that the next abseil version will still drop C\+\+11 compatibility, so a move to at least C\+\+14 will still be necessary in the near-ish future. was (Author: h-vetinari): Just FYI, conda-forge [now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides static builds of C++11/C++14 as "escape hatches" for packages that cannot yet use the C++17 dynamic libs. This takes the heat off a little bit - i.e. it allows packages to move at their own speed w.r.t to C++, as opposed to forcing a conda-forge-wide choice for abseil -, but note that the next abseil version will still drop C++11 compatibility, so a move to at least C++14 will still be necessary in the near-ish future. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577335#comment-17577335 ] Antoine Pitrou commented on ARROW-17110: [~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's possible to require C\+\+17 only for GCS-enabled builds. The important issue is about moving away from C\+\+11 for the_ whole codebase_, i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional dependency which requires it. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577335#comment-17577335 ] Antoine Pitrou edited comment on ARROW-17110 at 8/9/22 10:18 AM: - [~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's possible to require C\+\+17 only for GCS-enabled builds. The important issue is about moving away from C\+\+11 for the _whole codebase_, i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional dependency which requires it. was (Author: pitrou): [~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's possible to require C\+\+17 only for GCS-enabled builds. The important issue is about moving away from C\+\+11 for the_ whole codebase_, i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional dependency which requires it. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577342#comment-17577342 ] H. Vetinari commented on ARROW-17110: - Sure, I was only commenting from the POV of abseil; I was not aware how deeply enmeshed (or not) this is with the rest of arrow. If you can move the parts depending on abseil to C++14 separately (and presumably not build them for various older runtimes), then there's less urgency. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17252) [R] Intermittent valgrind failure
[ https://issues.apache.org/jira/browse/ARROW-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-17252: --- Assignee: Dewey Dunnington > [R] Intermittent valgrind failure > - > > Key: ARROW-17252 > URL: https://issues.apache.org/jira/browse/ARROW-17252 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Time Spent: 12.5h > Remaining Estimate: 0h > > A number of recent nightly builds have intermittent failures with valgrind, > which fails because of possibly leaked memory around an exec plan. This seems > related to a change in XXX that separated {{ExecPlan_prepare()}} from > {{ExecPlan_run()}} and added a {{ExecPlan_read_table()}} that uses > {{RunWithCapturedR()}}. The reported leaks vary but include ExecPlans and > ExecNodes and fields of those objects. > A failed run: > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30310&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=24980 > Some example output: > {noformat} > ==5249== 14,112 (384 direct, 13,728 indirect) bytes in 1 blocks are > definitely lost in loss record 1,988 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10B2902B: > std::_Function_handler > (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&), > arrow::compute::internal::RegisterAggregateNode(arrow::compute::ExecFactoryRegistry*)::{lambda(arrow::compute::ExecPlan*, > std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&)#1}>::_M_invoke(std::_Any_data const&, arrow::compute::ExecPlan*&&, > std::vector std::allocator >&&, > arrow::compute::ExecNodeOptions const&) (exec_plan.h:60) > ==5249==by 0xFA83A0C: > std::function > (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&)>::operator()(arrow::compute::ExecPlan*, > std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&) const (std_function.h:622) > ==5249== 14,528 (160 direct, 14,368 indirect) bytes in 1 blocks are > definitely lost in loss record 1,989 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10096CB7: arrow::FutureImpl::Make() (future.cc:187) > ==5249==by 0xFCB6F9A: arrow::Future::Make() > (future.h:420) > ==5249==by 0x101AE927: ExecPlanImpl (exec_plan.cc:50) > ==5249==by 0x101AE927: > arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, > std::shared_ptr) (exec_plan.cc:355) > ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45) > ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868) > ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601) > ==5249==by 0x49C2C16: bcEval (eval.c:7682) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > ==5249==by 0x49A0904: R_execClosure (eval.c:1918) > ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844) > ==5249==by 0x49B2122: bcEval (eval.c:7094) > ==5249== > ==5249== 36,322 (416 direct, 35,906 indirect) bytes in 1 blocks are > definitely lost in loss record 2,929 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10214F92: arrow::compute::TaskScheduler::Make() > (task_util.cc:421) > ==5249==by 0x101AEA6C: ExecPlanImpl (exec_plan.cc:50) > ==5249==by 0x101AEA6C: > arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, > std::shared_ptr) (exec_plan.cc:355) > ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45) > ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868) > ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601) > ==5249==by 0x49C2C16: bcEval (eval.c:7682) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > ==5249==by 0x49A0904: R_execClosure (eval.c:1918) > ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844) > ==5249==by 0x49B2122: bcEval (eval.c:7094) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > {noformat} > We also occasionally get leaked Schemas, and in one case a leaked InputType > that seemed completely unrelated to the other leaks (ARROW-17225). > I'm wondering if these have to do with references in lambdas that get passed > by reference? Or perhaps a cache issue? There were some instances in previous > leaks where the backtrace to the {{new}} allocator was different between > reported leaks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17252) [R] Intermittent valgrind failure
[ https://issues.apache.org/jira/browse/ARROW-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-17252. - Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13773 [https://github.com/apache/arrow/pull/13773] > [R] Intermittent valgrind failure > - > > Key: ARROW-17252 > URL: https://issues.apache.org/jira/browse/ARROW-17252 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 12h 40m > Remaining Estimate: 0h > > A number of recent nightly builds have intermittent failures with valgrind, > which fails because of possibly leaked memory around an exec plan. This seems > related to a change in XXX that separated {{ExecPlan_prepare()}} from > {{ExecPlan_run()}} and added a {{ExecPlan_read_table()}} that uses > {{RunWithCapturedR()}}. The reported leaks vary but include ExecPlans and > ExecNodes and fields of those objects. > A failed run: > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30310&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=24980 > Some example output: > {noformat} > ==5249== 14,112 (384 direct, 13,728 indirect) bytes in 1 blocks are > definitely lost in loss record 1,988 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10B2902B: > std::_Function_handler > (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&), > arrow::compute::internal::RegisterAggregateNode(arrow::compute::ExecFactoryRegistry*)::{lambda(arrow::compute::ExecPlan*, > std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&)#1}>::_M_invoke(std::_Any_data const&, arrow::compute::ExecPlan*&&, > std::vector std::allocator >&&, > arrow::compute::ExecNodeOptions const&) (exec_plan.h:60) > ==5249==by 0xFA83A0C: > std::function > (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&)>::operator()(arrow::compute::ExecPlan*, > std::vector std::allocator >, arrow::compute::ExecNodeOptions > const&) const (std_function.h:622) > ==5249== 14,528 (160 direct, 14,368 indirect) bytes in 1 blocks are > definitely lost in loss record 1,989 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10096CB7: arrow::FutureImpl::Make() (future.cc:187) > ==5249==by 0xFCB6F9A: arrow::Future::Make() > (future.h:420) > ==5249==by 0x101AE927: ExecPlanImpl (exec_plan.cc:50) > ==5249==by 0x101AE927: > arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, > std::shared_ptr) (exec_plan.cc:355) > ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45) > ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868) > ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601) > ==5249==by 0x49C2C16: bcEval (eval.c:7682) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > ==5249==by 0x49A0904: R_execClosure (eval.c:1918) > ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844) > ==5249==by 0x49B2122: bcEval (eval.c:7094) > ==5249== > ==5249== 36,322 (416 direct, 35,906 indirect) bytes in 1 blocks are > definitely lost in loss record 2,929 of 3,883 > ==5249==at 0x4849013: operator new(unsigned long) (in > /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==5249==by 0x10214F92: arrow::compute::TaskScheduler::Make() > (task_util.cc:421) > ==5249==by 0x101AEA6C: ExecPlanImpl (exec_plan.cc:50) > ==5249==by 0x101AEA6C: > arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, > std::shared_ptr) (exec_plan.cc:355) > ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45) > ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868) > ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601) > ==5249==by 0x49C2C16: bcEval (eval.c:7682) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > ==5249==by 0x49A0904: R_execClosure (eval.c:1918) > ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844) > ==5249==by 0x49B2122: bcEval (eval.c:7094) > ==5249==by 0x499DB95: Rf_eval (eval.c:748) > {noformat} > We also occasionally get leaked Schemas, and in one case a leaked InputType > that seemed completely unrelated to the other leaks (ARROW-17225). > I'm wondering if these have to do with references in lambdas that get passed > by reference? Or perhaps a cache issue? There were some instances in previous > leaks where the backtrace to the
[jira] [Assigned] (ARROW-13763) [Python] Files opened for read with pyarrow.parquet are not explicitly closed
[ https://issues.apache.org/jira/browse/ARROW-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Granger reassigned ARROW-13763: - Assignee: Miles Granger (was: Alessandro Molina) > [Python] Files opened for read with pyarrow.parquet are not explicitly closed > - > > Key: ARROW-13763 > URL: https://issues.apache.org/jira/browse/ARROW-13763 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 5.0.0 > Environment: fsspec 2021.4.0 >Reporter: Richard Kimoto >Assignee: Miles Granger >Priority: Major > Fix For: 10.0.0 > > Attachments: test.py > > > It appears that files opened for read using pyarrow.parquet.read_table (and > therefore pyarrow.parquet.ParquetDataset) are not explicitly closed. > This seems to be the case for both use_legacy_dataset=True and False. The > files don't remain open at the os level (verified using lsof). They do > however seem to rely on the python gc to close. > My use case is that i'd like to use a custom fsspec filesystem that > interfaces to an s3 like API. It handles the remote download of the parquet > file and passes to pyarrow a handle of a temporary file downloaded locally. > It then is looking for an explicit close() or __exit__() to then clean up the > temp file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17332) [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow
[ https://issues.apache.org/jira/browse/ARROW-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17332: - Summary: [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow (was: [R package] error parsing folder path with accent ('c:/Público') in read_csv_arrow) > [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow > -- > > Key: ARROW-17332 > URL: https://issues.apache.org/jira/browse/ARROW-17332 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > > I am a user trying the R arrow package on a windows machine. > To reproduce create a folder name containing a character with Latin accents > ``` > libary(arrow) > p <- 'c:/Público' > b <- read_csv_arrow(p) > Error: IOError: Failed to open local file 'c:/Público'. Detail: [Windows > error 5] Access is denied. > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17332) [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow
[ https://issues.apache.org/jira/browse/ARROW-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17332: - Component/s: R > [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow > -- > > Key: ARROW-17332 > URL: https://issues.apache.org/jira/browse/ARROW-17332 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lucas Mation >Priority: Major > > I am a user trying the R arrow package on a windows machine. > To reproduce create a folder name containing a character with Latin accents > ``` > libary(arrow) > p <- 'c:/Público' > b <- read_csv_arrow(p) > Error: IOError: Failed to open local file 'c:/Público'. Detail: [Windows > error 5] Access is denied. > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17331) [R] path with acent (ex: c:/Público)
[ https://issues.apache.org/jira/browse/ARROW-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane closed ARROW-17331. Resolution: Duplicate > [R] path with acent (ex: c:/Público) > > > Key: ARROW-17331 > URL: https://issues.apache.org/jira/browse/ARROW-17331 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13763) [Python] Files opened for read with pyarrow.parquet are not explicitly closed
[ https://issues.apache.org/jira/browse/ARROW-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13763: --- Labels: pull-request-available (was: ) > [Python] Files opened for read with pyarrow.parquet are not explicitly closed > - > > Key: ARROW-13763 > URL: https://issues.apache.org/jira/browse/ARROW-13763 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python >Affects Versions: 5.0.0 > Environment: fsspec 2021.4.0 >Reporter: Richard Kimoto >Assignee: Miles Granger >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Attachments: test.py > > Time Spent: 10m > Remaining Estimate: 0h > > It appears that files opened for read using pyarrow.parquet.read_table (and > therefore pyarrow.parquet.ParquetDataset) are not explicitly closed. > This seems to be the case for both use_legacy_dataset=True and False. The > files don't remain open at the os level (verified using lsof). They do > however seem to rely on the python gc to close. > My use case is that i'd like to use a custom fsspec filesystem that > interfaces to an s3 like API. It handles the remote download of the parquet > file and passes to pyarrow a handle of a temporary file downloaded locally. > It then is looking for an explicit close() or __exit__() to then clean up the > temp file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17312) [R] R session aborts when using dplyr::filter after setting as_data_frame = FALSE in arrow::read_csv_arrow
[ https://issues.apache.org/jira/browse/ARROW-17312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane closed ARROW-17312. Resolution: Not A Bug > [R] R session aborts when using dplyr::filter after setting as_data_frame = > FALSE in arrow::read_csv_arrow > -- > > Key: ARROW-17312 > URL: https://issues.apache.org/jira/browse/ARROW-17312 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 >Reporter: Neil Currie >Priority: Major > > R session aborts / encounters fatal error when using dplyr::filter after > setting as_data_frame = FALSE in arrow::read_csv_arrow. > > Version info: > platform [1] "x86_64-apple-darwin17.0" > R version 4.2.0 (2022-04-22) > Running on MacBook Air, macOS Monterey v12.4, Apple M1 chip > > Reproducible example: > > {{if (!require(pacman)) install.packages("pacman")}} > {{library(pacman)}} > {{p_load(arrow, dplyr, readr)}} > > {{write_csv(starwars, "starwars.csv")}} > > {{read_csv_arrow("starwars.csv", as_data_frame = FALSE) |> }}{{filter(height > >= 150)}} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17279) [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, as.logical(local), as.logical(now), ...):
[ https://issues.apache.org/jira/browse/ARROW-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17279: - Summary: [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, as.logical(local), as.logical(now), ...): (was: Error: package or namespace load failed for ‘arrow’ in inDL(x, as.logical(local), as.logical(now), ...):) > [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, > as.logical(local), as.logical(now), ...): > > > Key: ARROW-17279 > URL: https://issues.apache.org/jira/browse/ARROW-17279 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.1 > Environment: R version 4.1.0 (2021-05-18) >Reporter: Cristian Adir Cardona Merchan >Priority: Major > Labels: R, arrow, install, library > Fix For: 8.0.1 > > Attachments: rstudio_FYbWALFFSG.jpg > > > I have uploaded R version 4.1.0 (2021-05-18). After installing the arrow > package when I require it I got the message below, and also I have installed > Rtoos 4.0 related to R version. > Error message: > > {{> library(arrow)}} > Error: package or namespace load failed for ‘arrow’ in inDL(x, > as.logical(local), as.logical(now), ...): unable to load shared object > 'C:/Users/el_ki/Documents/R/win-library/4.1/arrow/libs/x64/arrow.dll': > LoadLibrary failure: Error en una rutina de inicialización de biblioteca de > vínculos dinámicos (DLL). > {{}} > {quote}sessionInfo() R version 4.1.0 (2021-05-18) Platform: > x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044) > {quote} > {{}} > Matrix products: default > {{}} > locale: [1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 > LC_MONETARY=Spanish_Colombia.1252 [4] LC_NUMERIC=C > LC_TIME=Spanish_Colombia.1252 > {{}} > attached base packages: [1] stats graphics grDevices utils datasets methods > base > {{}} > loaded via a namespace (and not attached): [1] tidyselect_1.1.2 bit_4.0.4 > compiler_4.1.0 magrittr_2.0.3 assertthat_0.2.1 R6_2.5.1 > [7] cli_3.3.0 tools_4.1.0 glue_1.6.2 rstudioapi_0.13 bit64_4.0.5 vctrs_0.4.1 > [13] data.table_1.14.2 rlang_1.0.4 purrr_0.3.4 > Thank you very much for your time. > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17208) [R] Removing files after reading them in R
[ https://issues.apache.org/jira/browse/ARROW-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17208: - Summary: [R] Removing files after reading them in R (was: Removing files after reading them in R) > [R] Removing files after reading them in R > -- > > Key: ARROW-17208 > URL: https://issues.apache.org/jira/browse/ARROW-17208 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.1 >Reporter: Wytze Gelderloos >Priority: Minor > > In R it's not possible to delete the files eventhough the dataframe is > cleared from the R environment. > > write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE) > df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", > delimiter = ",")) > df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect() > ## Do some checks on df. > rm(df) > file.remove("mtcars.csv") > The `file.remove` leads to a Permission Denied error eventhough the dataframe > is cleared from the R environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17208) [R] Removing files after reading them in R
[ https://issues.apache.org/jira/browse/ARROW-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17208: - Description: In R it's not possible to delete the files eventhough the dataframe is cleared from the R environment. {code:r} write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE) df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", delimiter = ",")) df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect() ## Do some checks on df. rm(df) file.remove("mtcars.csv") {code} The `file.remove` leads to a Permission Denied error eventhough the dataframe is cleared from the R environment. was: In R it's not possible to delete the files eventhough the dataframe is cleared from the R environment. write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE) df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", delimiter = ",")) df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect() ## Do some checks on df. rm(df) file.remove("mtcars.csv") The `file.remove` leads to a Permission Denied error eventhough the dataframe is cleared from the R environment. > [R] Removing files after reading them in R > -- > > Key: ARROW-17208 > URL: https://issues.apache.org/jira/browse/ARROW-17208 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.1 >Reporter: Wytze Gelderloos >Priority: Minor > > In R it's not possible to delete the files eventhough the dataframe is > cleared from the R environment. > {code:r} > write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE) > df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", > delimiter = ",")) > df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect() > ## Do some checks on df. > rm(df) > file.remove("mtcars.csv") > {code} > The `file.remove` leads to a Permission Denied error eventhough the dataframe > is cleared from the R environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17353) [Release] R libarrow binaries have the wrong version number
Jacob Wujciak-Jens created ARROW-17353: -- Summary: [Release] R libarrow binaries have the wrong version number Key: ARROW-17353 URL: https://issues.apache.org/jira/browse/ARROW-17353 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Affects Versions: 9.0.0 Reporter: Jacob Wujciak-Jens Fix For: 10.0.0 The libarrow binaries that are uploaded during the release process have the wrong version number. This is an issue with the submit binaries script/r-binary-packages job. The arrow version should be picked up by the job even if not passed explicitly as a custom param. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17113) [Java] All static initializers should catch and report exceptions
[ https://issues.apache.org/jira/browse/ARROW-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17113. -- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13678 [https://github.com/apache/arrow/pull/13678] > [Java] All static initializers should catch and report exceptions > - > > Key: ARROW-17113 > URL: https://issues.apache.org/jira/browse/ARROW-17113 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: good-first-issue, good-second-issue, > pull-request-available > Fix For: 10.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > As reported on the mailing list: > https://lists.apache.org/thread/gysn25gsm4v1fvvx9l0sjyr627xy7q65 > All static initializers should catch and report exceptions, or else they will > get swallowed by the JVM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17354) [C++] ARROW_ONLY_LINT should not require xsimd
Antoine Pitrou created ARROW-17354: -- Summary: [C++] ARROW_ONLY_LINT should not require xsimd Key: ARROW-17354 URL: https://issues.apache.org/jira/browse/ARROW-17354 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou When I try to configure Arrow with {{-DARROW_ONLY_LINT=on}} passed to CMake, I still get the following error: {code:java} CMake Error at cmake_modules/ThirdpartyToolchain.cmake:267 (find_package): By not providing "Findxsimd.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "xsimd", but CMake did not find one. Could not find a package configuration file provided by "xsimd" (requested version 8.1.0) with any of the following names: xsimdConfig.cmake xsimd-config.cmake Add the installation prefix of "xsimd" to CMAKE_PREFIX_PATH or set "xsimd_DIR" to a directory containing one of the above files. If "xsimd" provides a separate development package or SDK, be sure it has been installed. Call Stack (most recent call first): cmake_modules/ThirdpartyToolchain.cmake:2245 (resolve_dependency) CMakeLists.txt:575 (include) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests
[ https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577386#comment-17577386 ] Ben Kietzman commented on ARROW-17093: -- ... however, having written that I think the correct solution to the all-threads-trace problem is allowing the process to core dump then reading stacks out of that. This has two advantages over in-process tracing: - When a signal handler exists, the non-signaled threads continue execution until they receive signals of their own. However if a signal is known to be fatal, the OS can shut threads down more aggressively- this means we can get a less out-of-date traces from the threads which *didn't* segfault than we can with interthread signals - We'd probably be reading the core dump with gdb or another debugger and we'd have access to the process' full memory, so we could print not just snippets of the source files but values of local variables as well > [C++][CI] Enable libSegFault for C++ tests > -- > > Key: ARROW-17093 > URL: https://issues.apache.org/jira/browse/ARROW-17093 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: David Li >Priority: Major > > Adding libSegFault.so could make it easier to diagnose CI failures. It will > print a backtrace on segfault. > {noformat} > env SEGFAULT_SIGNALS=all \ > LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so > {noformat} > This will give a backtrace like this on segfault: > {noformat} > Backtrace: > /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b] > /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859] > /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e] > /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc] > /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83] > /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67] > {noformat} > Caveats: > * The path is OS-specific > * We could integrate it into the build tooling instead of doing it via env > var > * Are there easily accessible equivalents for MacOS and Windows we could use? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests
[ https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577386#comment-17577386 ] Ben Kietzman edited comment on ARROW-17093 at 8/9/22 12:26 PM: --- ... however, having written that I think the correct solution to the all-threads-trace problem is allowing the process to core dump then reading stacks out of that. This has two advantages over in-process tracing: - When a signal handler exists, the non-signaled threads continue execution until they receive signals of their own. However if a signal is known to be fatal, the OS can shut threads down more aggressively- this means we can get less out-of-date traces from the threads which *didn't* segfault than we can with interthread signals - We'd probably be reading the core dump with gdb or another debugger and we'd have access to the process' full memory, so we could print not just snippets of the source files but values of local variables as well was (Author: bkietz): ... however, having written that I think the correct solution to the all-threads-trace problem is allowing the process to core dump then reading stacks out of that. This has two advantages over in-process tracing: - When a signal handler exists, the non-signaled threads continue execution until they receive signals of their own. However if a signal is known to be fatal, the OS can shut threads down more aggressively- this means we can get a less out-of-date traces from the threads which *didn't* segfault than we can with interthread signals - We'd probably be reading the core dump with gdb or another debugger and we'd have access to the process' full memory, so we could print not just snippets of the source files but values of local variables as well > [C++][CI] Enable libSegFault for C++ tests > -- > > Key: ARROW-17093 > URL: https://issues.apache.org/jira/browse/ARROW-17093 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: David Li >Priority: Major > > Adding libSegFault.so could make it easier to diagnose CI failures. It will > print a backtrace on segfault. > {noformat} > env SEGFAULT_SIGNALS=all \ > LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so > {noformat} > This will give a backtrace like this on segfault: > {noformat} > Backtrace: > /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b] > /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859] > /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e] > /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc] > /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83] > /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde] > /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67] > {noformat} > Caveats: > * The path is OS-specific > * We could integrate it into the build tooling instead of doing it via env > var > * Are there easily accessible equivalents for MacOS and Windows we could use? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392 ] Jorrick Sleijster commented on ARROW-17335: --- Agreeing with you Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types. I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for new and start of with stub generation, which can then later be replaced by a better implementation once available. > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392 ] Jorrick Sleijster edited comment on ARROW-17335 at 8/9/22 12:32 PM: I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types. I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available. was (Author: JIRAUSER294085): I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types. I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for new and start of with stub generation, which can then later be replaced by a better implementation once available. > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392 ] Jorrick Sleijster edited comment on ARROW-17335 at 8/9/22 12:32 PM: I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types. I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for new and start of with stub generation, which can then later be replaced by a better implementation once available. was (Author: JIRAUSER294085): Agreeing with you Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types. I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for new and start of with stub generation, which can then later be replaced by a better implementation once available. > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience
Nicola Crane created ARROW-17355: Summary: [R] Refactor the handle_* utility functions for a better dev experience Key: ARROW-17355 URL: https://issues.apache.org/jira/browse/ARROW-17355 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane In ARROW-15260, the utility functions for handling different kinds of reading errors (handle_parquet_io_error, handle_csv_read_error, and handle_augmented_field_misuse) were refactored so that multiple ones could be chained together. An issue with this is that other errors may be swallowed if they're used without any errors that they don't capture being raised manually afterwards. We should update the code to prevent this from being possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17356) [R] Update binding for add_filename() NSE function to error if used on Table
Nicola Crane created ARROW-17356: Summary: [R] Update binding for add_filename() NSE function to error if used on Table Key: ARROW-17356 URL: https://issues.apache.org/jira/browse/ARROW-17356 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-15260 adds a function which allows the user to add the filename as an output field. This function only makes sense to use with datasets and not tables. Currently, the error generated from using it with a table is handled by {{handle_augmented_field_misuse()}}. Instead, we should follow [one of the suggestions from the PR|https://github.com/apache/arrow/pull/12826#issuecomment-1192007298] to detect this when the function is called. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17349) [C++] Support casting field names of list and map when nested
[ https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-17349: - Labels: good-first-issue kernel (was: ) > [C++] Support casting field names of list and map when nested > - > > Key: ARROW-17349 > URL: https://issues.apache.org/jira/browse/ARROW-17349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 9.0.0 >Reporter: Will Jones >Priority: Major > Labels: good-first-issue, kernel > Fix For: 10.0.0 > > > Different parquet implementations use different field names for internal > fields of ListType and MapType, which can sometimes cause silly conflicts. > For example, we use {{item}} as the field name for list, but Spark uses > {{element}}. Fortunately, we can automatically cast between List and Map > Types with different field names. Unfortunately, it only works at the top > level. We should get it to work at arbitrary levels of nesting. > This was discovered in delta-rs: > https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 > Here's a reproduction in Python: > {code:Python} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > def roundtrip_scanner(in_arr, out_type): > table = pa.table({"arr": in_arr}) > pq.write_table(table, "test.parquet") > schema = pa.schema({"arr": out_type}) > ds.dataset("test.parquet", schema=schema).to_table() > # MapType > ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) > ty = pa.map_(pa.int32(), pa.int32()) > arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # ListType > ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) > ty = pa.list_(pa.int32()) > arr_named = pa.array([[1, 2, 4]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Combination MapType and ListType > ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", > pa.int32(), nullable=True)), nullable=False)) > ty = pa.map_(pa.string(), pa.list_(pa.int32())) > arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) > roundtrip_scanner(arr_named, ty) > # Traceback (most recent call last): > # File "", line 1, in > # File "", line 5, in roundtrip_scanner > # File "pyarrow/_dataset.pyx", line 331, in > pyarrow._dataset.Dataset.to_table > # File "pyarrow/_dataset.pyx", line 2577, in > pyarrow._dataset.Scanner.to_table > # File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status > # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577414#comment-17577414 ] Kouhei Sutou commented on ARROW-17110: -- I think that we can switch to C\+\+14 or C\+\+17. Because it seems that we can mix a binary built with the default g\+\+ and a binary built with the debtoolset's g\+\+ in the same process on CentOS 7. I think that the following 2 cases: 1. Build Arrow with the devtoolset's g\+\+ and use the built Arrow as a library for a C++ program that is built with the default g\+\+. 2. {{dlopen()}} Arrow built with the devtoolset's g\+\+ and a library built with the default g\+\+ in the same process. 1. is meaningless for us as Antoine said. Sorry. 2. may be happen with Ruby. For example, {{ruby -r red-arrow -r unf_ext -e 'nil'}}. ({{unf_ext}} is one of Ruby libraries that use C++.) It seems that 2. works too. So I think that we can switch to C\+\+14 or C\+\+17. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES
[ https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577430#comment-17577430 ] Kouhei Sutou commented on ARROW-17329: -- Sorry... I said wrong CMake variable name... Could you use {{-DCMAKE_FIND_DEBUG_MODE=ON}} instead of {{-DCMAKE_FIND_PACKAGE_DEBUG=ON}}? > Build fails on alpine linux for arrow 9.0.0, /usr/local/include in > INTERFACE_INCLUDE_DIRECTORIES > > > Key: ARROW-17329 > URL: https://issues.apache.org/jira/browse/ARROW-17329 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: alpine linux edge >Reporter: Duncan Bellamy >Priority: Blocker > Fix For: 9.0.1 > > > zstd can now only be found if ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is > passed to cmake > trying to compile 9.0.0.0 I now get the error: > {noformat} > ??-- Configuring done?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/dataset/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??– Generating done?? > {noformat} > /usr/local/include does not exist in my build environment, or the builders > for alpine linux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17357) [CI][Conan] Enable JSON
Kouhei Sutou created ARROW-17357: Summary: [CI][Conan] Enable JSON Key: ARROW-17357 URL: https://issues.apache.org/jira/browse/ARROW-17357 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17357) [CI][Conan] Enable JSON
[ https://issues.apache.org/jira/browse/ARROW-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17357: --- Labels: pull-request-available (was: ) > [CI][Conan] Enable JSON > --- > > Key: ARROW-17357 > URL: https://issues.apache.org/jira/browse/ARROW-17357 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES
[ https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458 ] Kouhei Sutou commented on ARROW-17329: -- It seems that Alpin Linux edge's {{libzstd.pc}} is broken: {noformat} {noformat} > Build fails on alpine linux for arrow 9.0.0, /usr/local/include in > INTERFACE_INCLUDE_DIRECTORIES > > > Key: ARROW-17329 > URL: https://issues.apache.org/jira/browse/ARROW-17329 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: alpine linux edge >Reporter: Duncan Bellamy >Priority: Blocker > Fix For: 9.0.1 > > > zstd can now only be found if ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is > passed to cmake > trying to compile 9.0.0.0 I now get the error: > {noformat} > ??-- Configuring done?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/dataset/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??– Generating done?? > {noformat} > /usr/local/include does not exist in my build environment, or the builders > for alpine linux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES
[ https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458 ] Kouhei Sutou edited comment on ARROW-17329 at 8/9/22 2:38 PM: -- It seems that Alpin Linux edge's {{libzstd.pc}} is broken: {noformat} $ cat /usr/lib/pkgconfig/libzstd.pc # ZSTD - standard compression algorithm # Copyright (C) 2014-2016, Yann Collet, Facebook # BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) prefix=/usr/local exec_prefix=${prefix} includedir=${prefix}/include libdir=${exec_prefix}/lib Name: zstd Description: fast lossless compression algorithm library URL: http://www.zstd.net/ Version: 1.5.2 Libs: -L${libdir} -lzstd Libs.private: -pthread Cflags: -I${includedir} {noformat} It uses {{/usr/local}} as prefix. was (Author: kou): It seems that Alpin Linux edge's {{libzstd.pc}} is broken: {noformat} {noformat} > Build fails on alpine linux for arrow 9.0.0, /usr/local/include in > INTERFACE_INCLUDE_DIRECTORIES > > > Key: ARROW-17329 > URL: https://issues.apache.org/jira/browse/ARROW-17329 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: alpine linux edge >Reporter: Duncan Bellamy >Priority: Blocker > Fix For: 9.0.1 > > > zstd can now only be found if ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is > passed to cmake > trying to compile 9.0.0.0 I now get the error: > {noformat} > ??-- Configuring done?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/dataset/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??– Generating done?? > {noformat} > /usr/local/include does not exist in my build environment, or the builders > for alpine linux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES
[ https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458 ] Kouhei Sutou edited comment on ARROW-17329 at 8/9/22 2:38 PM: -- It seems that Alpine Linux edge's {{libzstd.pc}} is broken: {noformat} $ cat /usr/lib/pkgconfig/libzstd.pc # ZSTD - standard compression algorithm # Copyright (C) 2014-2016, Yann Collet, Facebook # BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) prefix=/usr/local exec_prefix=${prefix} includedir=${prefix}/include libdir=${exec_prefix}/lib Name: zstd Description: fast lossless compression algorithm library URL: http://www.zstd.net/ Version: 1.5.2 Libs: -L${libdir} -lzstd Libs.private: -pthread Cflags: -I${includedir} {noformat} It uses {{/usr/local}} as prefix. was (Author: kou): It seems that Alpin Linux edge's {{libzstd.pc}} is broken: {noformat} $ cat /usr/lib/pkgconfig/libzstd.pc # ZSTD - standard compression algorithm # Copyright (C) 2014-2016, Yann Collet, Facebook # BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) prefix=/usr/local exec_prefix=${prefix} includedir=${prefix}/include libdir=${exec_prefix}/lib Name: zstd Description: fast lossless compression algorithm library URL: http://www.zstd.net/ Version: 1.5.2 Libs: -L${libdir} -lzstd Libs.private: -pthread Cflags: -I${includedir} {noformat} It uses {{/usr/local}} as prefix. > Build fails on alpine linux for arrow 9.0.0, /usr/local/include in > INTERFACE_INCLUDE_DIRECTORIES > > > Key: ARROW-17329 > URL: https://issues.apache.org/jira/browse/ARROW-17329 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: alpine linux edge >Reporter: Duncan Bellamy >Priority: Blocker > Fix For: 9.0.1 > > > zstd can now only be found if ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is > passed to cmake > trying to compile 9.0.0.0 I now get the error: > {noformat} > ??-- Configuring done?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/dataset/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??– Generating done?? > {noformat} > /usr/local/include does not exist in my build environment, or the builders > for alpine linux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17299) [C++] [Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters
[ https://issues.apache.org/jira/browse/ARROW-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17299: --- Labels: pull-request-available (was: ) > [C++] [Python] Expose the Scanner kDefaultBatchReadahead and > kDefaultFragmentReadahead parameters > - > > Key: ARROW-17299 > URL: https://issues.apache.org/jira/browse/ARROW-17299 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Ziheng Wang >Assignee: Ziheng Wang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In the Scanner there are parameters kDefaultFragmentReadahead and > kDefaultBatchReadahead that are currently set to fixed numbers that cannot be > changed. > This is not great because tuning these numbers is the key to tradeoff RAM > usage and network IO utilization during reading. For example on an i3.2xlarge > instance on AWS you can get peak throughput only by quadrupling > kDefaultFragmentReadahead from the default. > The current settings are very conservative and assume a < 1Gbps network. > Exposing them allow people to tune the Scanner behavior to their own > hardware. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17358) [CI][C++] Add a job for Alpine Linux
Kouhei Sutou created ARROW-17358: Summary: [CI][C++] Add a job for Alpine Linux Key: ARROW-17358 URL: https://issues.apache.org/jira/browse/ARROW-17358 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17295) [C++] Build separate bundled_depenencies.so
[ https://issues.apache.org/jira/browse/ARROW-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones closed ARROW-17295. -- Resolution: Won't Fix > [C++] Build separate bundled_depenencies.so > --- > > Key: ARROW-17295 > URL: https://issues.apache.org/jira/browse/ARROW-17295 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 8.0.0, 8.0.1 >Reporter: Will Jones >Priority: Major > > When building arrow _static_ libraries with bundled dependencies, we produce > {{{}arrow_bundled_dependencies.a{}}}. But when building dynamic libraries, > the bundled dependencies are statically linked directly into the arrow > libraries (libarrow, libarrow_flight, etc.). This means that users can access > the symbols of bundled dependencies in the static case, but not in the > dynamic library case. > One use case of this is being able to pass in gRPC configuration to a Flight > server, which requires access to gRPC symbols. > Could we change the dynamic library building to build an > {{arrow_bundled_dependencies.so}} so that the symbols are accessible? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-15943: Labels: dataset (was: ) > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: dataset > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17358) [CI][C++] Add a job for Alpine Linux
[ https://issues.apache.org/jira/browse/ARROW-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17358: --- Labels: pull-request-available (was: ) > [CI][C++] Add a job for Alpine Linux > > > Key: ARROW-17358 > URL: https://issues.apache.org/jira/browse/ARROW-17358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577474#comment-17577474 ] Weston Pace commented on ARROW-15943: - I could see a few different ways this could be implemented: * We could add support for exclusion / inclusion filters for dataset discovery. These could be regular expressions that are applied to the filenames to determine whether we should or should not include them. * We could do more to support custom partitioning functions. The user could then create their own partitioning which includes this part of the filename as a partitioning column. * We could (not sure if we support this today or not) make sure we support filtering based on the filename column. However, this approach has the downside of loading all the unwanted data into memory. Do any of those approaches seem more appealing than the others? > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: dataset > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES
[ https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577480#comment-17577480 ] Duncan Bellamy commented on ARROW-17329: Thanks for finding that, strange how it wasn't causing an error before. Unless it was changed with an update, will report that and try building after its fixed. > Build fails on alpine linux for arrow 9.0.0, /usr/local/include in > INTERFACE_INCLUDE_DIRECTORIES > > > Key: ARROW-17329 > URL: https://issues.apache.org/jira/browse/ARROW-17329 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0 > Environment: alpine linux edge >Reporter: Duncan Bellamy >Priority: Blocker > Fix For: 9.0.1 > > > zstd can now only be found if ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is > passed to cmake > trying to compile 9.0.0.0 I now get the error: > {noformat} > ??-- Configuring done?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??CMake Error in src/arrow/dataset/CMakeLists.txt:?? > ??Imported target "zstd::libzstd" includes non-existent path?? > ??"/usr/local/include"?? > ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:?? > * ??The path was deleted, renamed, or moved to another location.?? > * ??An install or uninstall procedure did not complete successfully.?? > * ??The installation package was faulty and references files it does not?? > ??provide.?? > ??– Generating done?? > {noformat} > /usr/local/include does not exist in my build environment, or the builders > for alpine linux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17347) [C++][Docs] Describe limitations and alternatives for handling dependencies via package managers
[ https://issues.apache.org/jira/browse/ARROW-17347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577482#comment-17577482 ] Will Jones commented on ARROW-17347: We may also wish to mention the bundled dependencies, and explain why they are generally a last resort. See also discussion: https://lists.apache.org/thread/hgory6jkqg2vcqsw36635gsqcvkgk45z > [C++][Docs] Describe limitations and alternatives for handling dependencies > via package managers > > > Key: ARROW-17347 > URL: https://issues.apache.org/jira/browse/ARROW-17347 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Kae Suarez >Priority: Major > > In this page: [https://arrow.apache.org/docs/developers/cpp/building.html] it > is described how package managers can be used to get dependencies for Arrow. > My specific experience is with Apple Silicon on macOS, so I can describe my > experience there. A Brewfile is provided to assist getting necessary packages > to build Arrow, but it does not include all relevant packages to build the > most Arrow features possible on Mac (i.e., everything but CUDA). Diving > further, I learned that Homebrew cannot provide all necessary dependencies > for features such as GCS support, and had to turn to conda-forge. Upon doing > so, I started having overlapping dependencies elsewhere, and eventually had > to turn fully to conda. > It would be helpful to have the limitations laid out for what features can be > built with what package managers, as well as adding conda as an alternative > to Homebrew for macOS users, since that is necessary to build the fullest > Arrow possible without building other libraries from source as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE
[ https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-17338: --- Assignee: Todd Farmer > [Java] The maximum request memory of BaseVariableWidthVector should limit to > Interger.MAX_VALUE > --- > > Key: ARROW-17338 > URL: https://issues.apache.org/jira/browse/ARROW-17338 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Xianyang Liu >Assignee: Todd Farmer >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > We got a IndexOutOfBoundsException: > ``` > 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: > Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): > java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 > (expected: range(0, 2147483648)) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) > at > org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) > at > org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) > at > org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) > at > org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) > at > org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > ``` > The root cause is the following code of `BaseVariableWidthVector.handleSafe` > could fail to relocated because of int overflow and then led to > `IndexOutOfBoundsException` when we put the data into the vector. > ```java > protected final void handleSafe(int index, int dataLength) { > while (index >= getValueCapacity()) { > reallocValidityAndOffsetBuffers(); > } > final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); > // startOffset + dataLength could overflow > while (valueBuffer.capacity() < (startOffset + dataLength)) { > reallocDataBuffer(); > } > } > ``` > The offset width of `BaseVariableWidthVector` is 4, while the maximum memory > allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function
[ https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17327: --- Labels: pull-request-available (was: ) > [Python] Parquet should be listed in PyArrow's get_libraries() function > --- > > Key: ARROW-17327 > URL: https://issues.apache.org/jira/browse/ARROW-17327 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Steven Silvester >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following > [failure| > https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true] > when building wheels: "@rpath/libparquet.800.dylib not found". > We overcame the error by explicitly adding "parquet" to the list of libraries > returned by {{get_libraries}}. I am happy to submit a PR. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17351) [C++][Compute] Support to initialize expression with a string
[ https://issues.apache.org/jira/browse/ARROW-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577503#comment-17577503 ] Weston Pace commented on ARROW-17351: - I think the first (more verbose) option is preferred because it will be more generic. However, if the first option is working, the second option can always be added later as an optional shortcut (and then support both). > [C++][Compute] Support to initialize expression with a string > - > > Key: ARROW-17351 > URL: https://issues.apache.org/jira/browse/ARROW-17351 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 8.0.1 >Reporter: LinGeLin >Priority: Major > > I want to achieve such a function, first would like to ask you which way to > achieve better > > For example, I want to initialize an expression whose content is > a210 - (a210 /203) * 203 = 0 > This means that column A210 modulo 203 is equal to 0 > > How do you compare these two ideas? > > "(subtract(a210, multiply(divide(a210, 203), 203)) == 0)" to Expression > or > "a210-(a210/203)*203==0" to Expression -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577524#comment-17577524 ] Joris Van den Bossche commented on ARROW-10739: --- [~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far as I know, nobody is actively working on this, so a PR is certainly welcome. I think option (1) is a good path forward. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577524#comment-17577524 ] Joris Van den Bossche edited comment on ARROW-10739 at 8/9/22 4:40 PM: --- [~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far as I know, nobody is actively working on this, so a PR is certainly welcome! I think option (1) is a good path forward. was (Author: jorisvandenbossche): [~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far as I know, nobody is actively working on this, so a PR is certainly welcome. I think option (1) is a good path forward. > [Python] Pickling a sliced array serializes all the buffers > --- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Maarten Breddels >Assignee: Alessandro Molina >Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 74 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function
[ https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577525#comment-17577525 ] Joris Van den Bossche commented on ARROW-17327: --- Out of curiosity, do you have an idea why this started to fail with pyarrow 8.0? (I am not aware of something we have changed regarding this, and pyarrow has been built against a parquet library for a long time) > [Python] Parquet should be listed in PyArrow's get_libraries() function > --- > > Key: ARROW-17327 > URL: https://issues.apache.org/jira/browse/ARROW-17327 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Steven Silvester >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following > [failure| > https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true] > when building wheels: "@rpath/libparquet.800.dylib not found". > We overcame the error by explicitly adding "parquet" to the list of libraries > returned by {{get_libraries}}. I am happy to submit a PR. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function
[ https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577531#comment-17577531 ] Joris Van den Bossche commented on ARROW-17327: --- Ah, actually, pyarrow itself has always been built against libparquet (and included this in the wheels), but for the arrow_python library itself this dependency was indeed introduced in 8.0 with ARROW-9947. > [Python] Parquet should be listed in PyArrow's get_libraries() function > --- > > Key: ARROW-17327 > URL: https://issues.apache.org/jira/browse/ARROW-17327 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Steven Silvester >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following > [failure| > https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true] > when building wheels: "@rpath/libparquet.800.dylib not found". > We overcame the error by explicitly adding "parquet" to the list of libraries > returned by {{get_libraries}}. I am happy to submit a PR. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones resolved ARROW-17352. Resolution: Not A Problem > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0 > - > > Key: ARROW-17352 > URL: https://issues.apache.org/jira/browse/ARROW-17352 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 > Environment: Windows10 >Reporter: Oliver Klein >Priority: Critical > Attachments: arrow9error.PNG > > > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0. It worked when stored with version 8 and earlier. > Windows Parquet Viewer: 2.3.5 and 2.3.6 > pyarrow version: 9.0.0 > Error: System.AggregateException: One or more errors occured. ---> > Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. > at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in > DataColumnReader.cs: line 259 > > After further checking I found that it seems the problem seems to relate to a > default parquet version change. > When I use pyarrow 9 and configure version to 1.0 it works again from the > windows tool - when its 2.4 its not working (or supported in the windows > tool). > df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') > df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') > Question might be if such a default change is a bug or a feature. > Finally found: > * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280) > So probably its a feature - and we need to adapt our code > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577537#comment-17577537 ] Will Jones commented on ARROW-17352: Yes, it's intentional that we increased the default version of Parquet. Hopefully Windows Parquet Viewer will add support for the more recent version soon. > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0 > - > > Key: ARROW-17352 > URL: https://issues.apache.org/jira/browse/ARROW-17352 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 > Environment: Windows10 >Reporter: Oliver Klein >Priority: Critical > Attachments: arrow9error.PNG > > > Parquet files cannot be opened in Windows Parquet Viewer when stored with > Arrow Version 9.0.0. It worked when stored with version 8 and earlier. > Windows Parquet Viewer: 2.3.5 and 2.3.6 > pyarrow version: 9.0.0 > Error: System.AggregateException: One or more errors occured. ---> > Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. > at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in > DataColumnReader.cs: line 259 > > After further checking I found that it seems the problem seems to relate to a > default parquet version change. > When I use pyarrow 9 and configure version to 1.0 it works again from the > windows tool - when its 2.4 its not working (or supported in the windows > tool). > df.to_parquet(r'C:\temp\test_10.parquet', version='1.0') > df.to_parquet(r'C:\temp\test_24.parquet', version='2.4') > Question might be if such a default change is a bug or a feature. > Finally found: > * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280) > So probably its a feature - and we need to adapt our code > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE
[ https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-17338: --- Assignee: Xianyang Liu (was: Todd Farmer) > [Java] The maximum request memory of BaseVariableWidthVector should limit to > Interger.MAX_VALUE > --- > > Key: ARROW-17338 > URL: https://issues.apache.org/jira/browse/ARROW-17338 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > We got a IndexOutOfBoundsException: > ``` > 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: > Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): > java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 > (expected: range(0, 2147483648)) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) > at > org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) > at > org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) > at > org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) > at > org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) > at > org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > ``` > The root cause is the following code of `BaseVariableWidthVector.handleSafe` > could fail to relocated because of int overflow and then led to > `IndexOutOfBoundsException` when we put the data into the vector. > ```java > protected final void handleSafe(int index, int dataLength) { > while (index >= getValueCapacity()) { > reallocValidityAndOffsetBuffers(); > } > final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); > // startOffset + dataLength could overflow > while (valueBuffer.capacity() < (startOffset + dataLength)) { > reallocDataBuffer(); > } > } > ``` > The offset width of `BaseVariableWidthVector` is 4, while the maximum memory > allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17335) [Python] Type checking support
[ https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577564#comment-17577564 ] Joris Van den Bossche commented on ARROW-17335: --- Mypy doesn't use pyi files when eg doing `mypy pyarrow`? > [Python] Type checking support > -- > > Key: ARROW-17335 > URL: https://issues.apache.org/jira/browse/ARROW-17335 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Jorrick Sleijster >Priority: Major > Original Estimate: 10h > Remaining Estimate: 10h > > h1. mypy and static type checking > As of Python3.6, it has been possible to introduce typing information in the > code. This became immensely popular in a short period of time. Shortly after, > the tool `mypy` arrived and this has become the industry standard for static > type checking inside Python. It is able to check very quickly for invalid > types which makes it possible to serve as a pre-commit. It has raised many > bugs that I did not see myself and has been a very valuable tool. > h2. Now what does this mean for PyArrow? > When we run mypy on code that uses PyArrow, you will get error message as > follows: > ``` > some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": > module is installed, but missing library stubs or py.typed marker > some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing > "pyarrow.fs": module is installed, but missing library stubs or py.typed > marker > ``` > More information is available here: > [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker] > h2. You can solve this in three ways: > # Ignore the message. This, however, will put all types from PyArrow to > `Any`, making it unable to find user errors with the PyArrow library > # Create a Python stub file. This is what previously used to be the > standard, however, it no longer a popular option. This is because stubs are > extra, next to the source code, while you can also inline the code with type > hints, which brings me to our third option. > # Create a `py.typed` file and use inline type hints. This is the most > popular option today because it requires no extra files (except for the > py.typed file), allows all the type hints to be with the code (like now in > the documentation) and not only provides your users but also the developers > of the library themselves with type hints (and hinting of issues inside your > IDE). > > My personal opinion already shines through the options, it is 3 as this has > shortly become the industry standard since the introduction. > h2. What should we do? > I'd very much like to work on this, however, I don't feel like wasting time. > Therefore, I am raising this ticket to see if this had been considered before > or if we just didn't get to this yet. > I'd like to open the discussion here: > # Do you agree with number #3 as type hints. > # Should we remove the documentation annotations for the type hints given > they will be inside the functions? Or should we keep it and specify it in the > code? Which would make it double. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE
[ https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577566#comment-17577566 ] Todd Farmer commented on ARROW-17338: - [~coneyliu] : Thank you for the bug report and the pull request! Your account has been given the "contributor" role, which allows ARROW issues - such as this - to be assigned to you. I've assigned this issue to you to reflect your already-made contributions - please assign the issue to me if you have any concern with that. > [Java] The maximum request memory of BaseVariableWidthVector should limit to > Interger.MAX_VALUE > --- > > Key: ARROW-17338 > URL: https://issues.apache.org/jira/browse/ARROW-17338 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > We got a IndexOutOfBoundsException: > ``` > 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: > Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): > java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 > (expected: range(0, 2147483648)) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) > at > org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) > at > org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) > at > org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) > at > org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) > at > org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) > at > org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) > at > org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > ``` > The root cause is the following code of `BaseVariableWidthVector.handleSafe` > could fail to relocated because of int overflow and then led to > `IndexOutOfBoundsException` when we put the data into the vector. > ```java > protected final void handleSafe(int index, int dataLength) { > while (index >= getValueCapacity()) { > reallocValidityAndOffsetBuffers(); > } > final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); > // startOffset + dataLength could overflow > while (valueBuffer.capacity() < (startOffset + dataLength)) { > reallocDataBuffer(); > } > } > ``` > The offset width of `BaseVariableWidthVector` is 4, while the maximum memory > allocation is Long.MA
[jira] [Commented] (ARROW-17169) [Go] goPanicIndex in firstTimeBitmapWriter.Finish()
[ https://issues.apache.org/jira/browse/ARROW-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577569#comment-17577569 ] Matthew Topol commented on ARROW-17169: --- [~Purdom] any updates here? > [Go] goPanicIndex in firstTimeBitmapWriter.Finish() > --- > > Key: ARROW-17169 > URL: https://issues.apache.org/jira/browse/ARROW-17169 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 9.0.0, 8.0.1 > Environment: go (1.18.3), Linux, AMD64 >Reporter: Robert Purdom >Priority: Critical > > I'm working with complex parquet files with 500+ "root" columns where some > fields are lists of structs, internally referred to as 'topics'. Some of > these structs have 100's of columns. When reading a particular topic, I get > an Index Panic at the line indicated below. This error occurs when the value > for the topic is Null, as in, for this particular root record, this topic has > no data. The root is household data, the topic is auto, so the error occurs > when the household has no autos. The auto field is a Nullable List of Struct. > > {code:go} > /* Finish() was called from defLevelsToBitmapInternal. > data values when panic occurs > bw.length == 17531 > bw.bitMask == 1 > bw.pos == 3424 > bw.length == 17531 > len(bw.Buf) == 428 > cap(bw.Buf) == 448 > bw.byteOffset == 428 > bw.curByte == 0 > */ > // bitmap_writer.go > func (bw *firstTimeBitmapWriter) Finish() { > // store curByte into the bitmap > if bw.length >0&& bw.bitMask !=0x01|| bw.pos < bw.length { > bw.buf[int(bw.byteOffset)] = bw.curByte // < Panic index > } > } > {code} > In every case, when the panic occurs, bw.byteOffset == len(bw.Buf). I tested > the below modification and it does remedy the bug. However, it's probably > only masking the actual bug. > {code:go} > // Test version: No Panic > func (bw *firstTimeBitmapWriter) Finish() { > // store curByte into the bitmap > if bw.length > 0 && bw.bitMask != 0x01 || bw.pos < bw.length { > if int(bw.byteOffset) == len(bw.Buf) { > bw.buf = append(bw.buf, bw.curByte) > } else { >bw.buf[int(bw.byteOffset)] = bw.curByte >} > } > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go
[ https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-17326: -- Component/s: FlightRPC Go SQL > [Go][FlightSQL] Add Support for FlightSQL to Go > --- > > Key: ARROW-17326 > URL: https://issues.apache.org/jira/browse/ARROW-17326 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC, Go, SQL >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > > Also addresses https://github.com/apache/arrow/issues/12496 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go
[ https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17326: --- Labels: pull-request-available (was: ) > [Go][FlightSQL] Add Support for FlightSQL to Go > --- > > Key: ARROW-17326 > URL: https://issues.apache.org/jira/browse/ARROW-17326 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC, Go, SQL >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Also addresses https://github.com/apache/arrow/issues/12496 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC
[ https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577573#comment-17577573 ] Kae Suarez commented on ARROW-17344: Unfortunately, recompiling with ORC on just worked out. Maybe it just grabbed the include path correctly this time. I'll switch the status until someone can reproduce this. > [C++] ORC in build breaks code using Arrow and not ORC > -- > > Key: ARROW-17344 > URL: https://issues.apache.org/jira/browse/ARROW-17344 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kae Suarez >Priority: Major > > After building Arrow from source, with ORC enabled and retrieved from a > package manager, I went to build some software that did not use ORC, and was > unable to compile because ORC was not found by Arrow internals, and I didn't > include ORC in my final CMake file (the one for my Arrow-using program), > because I simply wasn't using it. > I rebuilt Arrow without ORC for now, but the ability to include ORC as a > feature in Arrow and not have to include it in future CMakes when ORC isn't > in use would be nice. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC
[ https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kae Suarez updated ARROW-17344: --- Priority: Trivial (was: Major) > [C++] ORC in build breaks code using Arrow and not ORC > -- > > Key: ARROW-17344 > URL: https://issues.apache.org/jira/browse/ARROW-17344 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kae Suarez >Priority: Trivial > > After building Arrow from source, with ORC enabled and retrieved from a > package manager, I went to build some software that did not use ORC, and was > unable to compile because ORC was not found by Arrow internals, and I didn't > include ORC in my final CMake file (the one for my Arrow-using program), > because I simply wasn't using it. > I rebuilt Arrow without ORC for now, but the ability to include ORC as a > feature in Arrow and not have to include it in future CMakes when ORC isn't > in use would be nice. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC -- not reproducible
[ https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kae Suarez updated ARROW-17344: --- Summary: [C++] ORC in build breaks code using Arrow and not ORC -- not reproducible (was: [C++] ORC in build breaks code using Arrow and not ORC) > [C++] ORC in build breaks code using Arrow and not ORC -- not reproducible > -- > > Key: ARROW-17344 > URL: https://issues.apache.org/jira/browse/ARROW-17344 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kae Suarez >Priority: Trivial > > After building Arrow from source, with ORC enabled and retrieved from a > package manager, I went to build some software that did not use ORC, and was > unable to compile because ORC was not found by Arrow internals, and I didn't > include ORC in my final CMake file (the one for my Arrow-using program), > because I simply wasn't using it. > I rebuilt Arrow without ORC for now, but the ability to include ORC as a > feature in Arrow and not have to include it in future CMakes when ORC isn't > in use would be nice. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17293) [Java][CI] Prune java nightly builds
[ https://issues.apache.org/jira/browse/ARROW-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Dali Susanibar Arce reassigned ARROW-17293: - Assignee: David Dali Susanibar Arce > [Java][CI] Prune java nightly builds > > > Key: ARROW-17293 > URL: https://issues.apache.org/jira/browse/ARROW-17293 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Java >Reporter: Jacob Wujciak-Jens >Assignee: David Dali Susanibar Arce >Priority: Critical > > Currently we are accumulating a huge number of nightly java jars. We should > prune them to keep max. 14 versions around. (see r_nightly.yml) > It might also be nice to always rename/copy the most recent jars to something > fixed so there is no need to update your local config to always have the > newest version? (but up to the java users if this is necessary/worth it). > > cc [~dsusanibara] [~ljw1001] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17359) [Go][FlightSQL] Create SQLite example
Matthew Topol created ARROW-17359: - Summary: [Go][FlightSQL] Create SQLite example Key: ARROW-17359 URL: https://issues.apache.org/jira/browse/ARROW-17359 Project: Apache Arrow Issue Type: New Feature Components: FlightRPC, Go, SQL Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
Matthew Roeschke created ARROW-17360: Summary: [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns Key: ARROW-17360 URL: https://issues.apache.org/jira/browse/ARROW-17360 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 8.0.1 Reporter: Matthew Roeschke xref [https://github.com/pandas-dev/pandas/issues/47944] {code:java} In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) # pandas main branch / 1.5 In [2]: df.to_orc("abc") In [3]: pd.read_orc("abc", columns=['b', 'a']) Out[3]: a b 0 1 a 1 2 b 2 3 c In [4]: import pyarrow.orc as orc In [5]: orc_file = orc.ORCFile("abc") # reordered to a, b In [6]: orc_file.read(columns=['b', 'a']).to_pandas() Out[6]: a b 0 1 a 1 2 b 2 3 c # reordered to a, b In [7]: orc_file.read(columns=['b', 'a']) Out[7]: pyarrow.Table a: int64 b: string a: [[1,2,3]] b: [["a","b","c"]] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17170) [C++][Docs] Research Documentation Formats
[ https://issues.apache.org/jira/browse/ARROW-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kae Suarez closed ARROW-17170. -- Resolution: Done We have plenty of sources, and are moving forward with the Getting Started page currently. > [C++][Docs] Research Documentation Formats > -- > > Key: ARROW-17170 > URL: https://issues.apache.org/jira/browse/ARROW-17170 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Documentation >Reporter: Kae Suarez >Assignee: Kae Suarez >Priority: Major > > In order to revise the documentation, some inspiration is needed to get the > format right. This ticket provides a space for exploration of possible > inspiration for the C++ documentation – once we have some good examples > and/or agreement, we can move to some content creation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17361) dplyr::summarize fails with division when divisor is a variable
Oliver Reiter created ARROW-17361: - Summary: dplyr::summarize fails with division when divisor is a variable Key: ARROW-17361 URL: https://issues.apache.org/jira/browse/ARROW-17361 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 8.0.0 Reporter: Oliver Reiter Hello, I found this odd behaviour when trying to compute an aggregate with dplyr::summarize: When I want to use a pre-defined variable to do a divison while aggregating, the execution fails with 'unsupported expression'. When I the value of the variable as is in the aggregation, it works. See below: {code:java} library(dplyr) library(arrow) small_dataset <- tibble::tibble( ## x = rep(c("a", "b"), each = 5), y = rep(1:5, 2) ) ## convert "small_dataset" into a ...dataset tmpdir <- tempfile() dir.create(tmpdir) write_dataset(small_dataset, tmpdir) ## works open_dataset(tmpdir) %>% summarize(value = sum(y) / 10) %>% collect() ## fails scale_factor <- 10 open_dataset(tmpdir) %>% summarize(value = sum(y) / scale_factor) %>% collect() #> Fehler: Error in summarize_eval(names(exprs)[i], #> exprs[[i]], ctx, length(.data$group_by_vars) > : # Expression sum(y)/scale_factor is not an aggregate # expression or is not supported in Arrow # Call collect() first to pull data into R. {code} I was not sure how to name this issue/bug (if it is one), so if there is a clearer, more descriptive title you're welcome to adjust. Thanks for your work! Oliver {code:java} > arrow_info() Arrow package version: 8.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator jemalloc Current 64 bytes Max 41.25 Kb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 8.0.0 C++ Compiler GNU C++ Compiler Version 12.1.0 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577621#comment-17577621 ] Nicola Crane commented on ARROW-15943: -- I'm not sure. >From an R perspective, if it's an option, I think it would be fine to support >passing in a list of filenames but still being able to use the directory names >as dataset variables, if that's possible (as R users are likely to be >comfortable pre-filtering the list of files). This feels like it would fit >with option 3; I am currently working on ARROW-15260 which would allow users >to add the fragment filename as a column, which they could then use to filter >on (though I recall in a conversation on that PR or ticket, you mentioning >that we can't properly do pushdown filtering yet using that?) However, you >mention the issue of loading the unwanted data into memory - I guess for these >users they might choose to use something other than arrow if this was >acceptable to them. Option 1 sounds good too. I don't fully understand what option 2 would look like, but if it's something we could wrap in R to achieve solutions to the 2 linked Stack Overflow questions, then great. Ultimately, I don't think there's an obvious best approach here, and that solving for the simplest case ("I have directories containing files, which I wish to both selectively load in some files from, but also use the directory structure to create variables") will get us most of the way there unless any super-specialist use cases emerge later. Option 1 sounds potentially simplest? > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: dataset > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17290) [C++] Add order-comparisons for numeric scalars
[ https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577629#comment-17577629 ] Yaron Gvili commented on ARROW-17290: - {quote}I'm curious, what is the use case? {quote} See [this post|http://example.com]. > [C++] Add order-comparisons for numeric scalars > --- > > Key: ARROW-17290 > URL: https://issues.apache.org/jira/browse/ARROW-17290 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently, only equal-comparison of scalars are supported, by > `EqualComparable`. This issue will add order-comparisons, such as less-than, > to numeric scalars. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17290) [C++] Add order-comparisons for numeric scalars
[ https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577629#comment-17577629 ] Yaron Gvili edited comment on ARROW-17290 at 8/9/22 9:16 PM: - {quote}I'm curious, what is the use case? {quote} See [this post|https://github.com/apache/arrow/pull/13784#issuecomment-1209861142]. was (Author: JIRAUSER284707): {quote}I'm curious, what is the use case? {quote} See [this post|http://example.com]. > [C++] Add order-comparisons for numeric scalars > --- > > Key: ARROW-17290 > URL: https://issues.apache.org/jira/browse/ARROW-17290 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently, only equal-comparison of scalars are supported, by > `EqualComparable`. This issue will add order-comparisons, such as less-than, > to numeric scalars. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-11699) [R] Implement dplyr::across() inside mutate()
[ https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-11699: - Summary: [R] Implement dplyr::across() inside mutate() (was: [R] Implement dplyr::across()) > [R] Implement dplyr::across() inside mutate() > - > > Key: ARROW-11699 > URL: https://issues.apache.org/jira/browse/ARROW-11699 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > It's not a generic, but because it seems only to be called inside of > functions like `mutate()`, we can insert our own version of it into the NSE > data mask -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17362) [R] Implement dplyr::across() inside summarise()
Nicola Crane created ARROW-17362: Summary: [R] Implement dplyr::across() inside summarise() Key: ARROW-17362 URL: https://issues.apache.org/jira/browse/ARROW-17362 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate(). Once this is merged, we should also add the ability to do so within dplyr::summarise(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16017) [C++] Benchmark key_hash and document tradeoffs with vendored xxhash
[ https://issues.apache.org/jira/browse/ARROW-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577635#comment-17577635 ] Aldrin Montana commented on ARROW-16017: Linking to ARROW-8991, where the PR starts to add some of this benchmarking > [C++] Benchmark key_hash and document tradeoffs with vendored xxhash > > > Key: ARROW-16017 > URL: https://issues.apache.org/jira/browse/ARROW-16017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > > ARROW-15239 adds a vectorized hashing function for use initially in execution > engine bloom filters and later in the hash-join node. We should add some > benchmarks to explore how the performance compares to the vendored scalar > xxhash implementation. In addition, we should document where the two differ > and explain any tradeoffs for future users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577637#comment-17577637 ] Weston Pace commented on ARROW-15943: - I agree that option 1 is the simplest and probably preferred option. > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: dataset > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
[ https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-15943: Labels: dataset good-second-issue (was: dataset) > [C++] Filter which files to be read in as part of filesystem, filtered using > a string > - > > Key: ARROW-15943 > URL: https://issues.apache.org/jira/browse/ARROW-15943 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: dataset, good-second-issue > > There is a report from a user (see this Stack Overflow post [1]) who has used > the {{basename_template}} parameter to write files to a dataset, some of > which have the prefix {{"summary"}} and others which have the prefix > "{{{}prediction"{}}}. This data is saved in partitioned directories. They > want to be able to read back in the data, so that, as well as the partition > variables in their dataset, they can choose which subset (predictions vs. > summaries) to read back in. > This isn't currently possible; if they try to open a dataset with a list of > files, they cannot read it in as partitioned data. > A short-term solution is to suggest they change the structure of how their > data is stored, but it could be useful to be able to pass in some sort of > filter to determine which files get read in as a dataset. > > [1] > [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17363) [C++][Compute] Add Cryptographic hash functions to Acero
Aldrin Montana created ARROW-17363: -- Summary: [C++][Compute] Add Cryptographic hash functions to Acero Key: ARROW-17363 URL: https://issues.apache.org/jira/browse/ARROW-17363 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Aldrin Montana We would like to add cryptographic hash function kernels to Acero (MD5, SHA1, etc.). At this time, there are no particular cryptographic functions we want to start adding, but the only hash functions available seem to be variants or specializations of xxHash, which is not appropriate for cryptographic use cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17363) [C++][Compute] Add Cryptographic hash functions to Acero
[ https://issues.apache.org/jira/browse/ARROW-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577641#comment-17577641 ] Aldrin Montana commented on ARROW-17363: Linking to ARROW-8991, which adds some hashing functions to the compute API > [C++][Compute] Add Cryptographic hash functions to Acero > > > Key: ARROW-17363 > URL: https://issues.apache.org/jira/browse/ARROW-17363 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Aldrin Montana >Priority: Minor > Labels: C++, compute, cryptography, query-engine > > We would like to add cryptographic hash function kernels to Acero (MD5, SHA1, > etc.). > At this time, there are no particular cryptographic functions we want to > start adding, but the only hash functions available seem to be variants or > specializations of xxHash, which is not appropriate for cryptographic use > cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17364) [R] Implement .names argument inside across()
Nicola Crane created ARROW-17364: Summary: [R] Implement .names argument inside across() Key: ARROW-17364 URL: https://issues.apache.org/jira/browse/ARROW-17364 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17365) [R] Implement ... argument inside across()
Nicola Crane created ARROW-17365: Summary: [R] Implement ... argument inside across() Key: ARROW-17365 URL: https://issues.apache.org/jira/browse/ARROW-17365 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17365) [R] Implement ... argument inside across()
[ https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17365: - Description: ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{...}} argument is not yet supported but should be added. was:ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. > [R] Implement ... argument inside across() > -- > > Key: ARROW-17365 > URL: https://issues.apache.org/jira/browse/ARROW-17365 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The > {{...}} argument is not yet supported but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17365) [R] Implement ... argument inside across()
[ https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17365: - Description: ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{...}} argument is not yet supported but should be added. There is a failing test in the PR for ARROW-11699 which references this JIRA. was: ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{...}} argument is not yet supported but should be added. > [R] Implement ... argument inside across() > -- > > Key: ARROW-17365 > URL: https://issues.apache.org/jira/browse/ARROW-17365 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The > {{...}} argument is not yet supported but should be added. There is a > failing test in the PR for ARROW-11699 which references this JIRA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17365) [R] Implement ... argument inside across()
[ https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577648#comment-17577648 ] Nicola Crane commented on ARROW-17365: -- We should not implement this as the {{...}} argument [is deprecated|https://github.com/tidyverse/dplyr/blob/HEAD/R/across.R#L36] > [R] Implement ... argument inside across() > -- > > Key: ARROW-17365 > URL: https://issues.apache.org/jira/browse/ARROW-17365 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The > {{...}} argument is not yet supported but should be added. There is a > failing test in the PR for ARROW-11699 which references this JIRA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17365) [R] Implement ... argument inside across()
[ https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane closed ARROW-17365. Resolution: Won't Fix > [R] Implement ... argument inside across() > -- > > Key: ARROW-17365 > URL: https://issues.apache.org/jira/browse/ARROW-17365 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The > {{...}} argument is not yet supported but should be added. There is a > failing test in the PR for ARROW-11699 which references this JIRA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17364) [R] Implement .names argument inside across()
[ https://issues.apache.org/jira/browse/ARROW-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17364: - Description: ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. Additional tests looking at different ways of specifying the {{.fns}} argument should be re-enabled (see tests with this ticket number in their comments, and https://github.com/tidyverse/dplyr/issues/6395 for more context). was:ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. > [R] Implement .names argument inside across() > - > > Key: ARROW-17364 > URL: https://issues.apache.org/jira/browse/ARROW-17364 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The > {{.names}} argument is not yet supported but should be added. > Additional tests looking at different ways of specifying the {{.fns}} > argument should be re-enabled (see tests with this ticket number in their > comments, and https://github.com/tidyverse/dplyr/issues/6395 for more > context). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()
Nicola Crane created ARROW-17366: Summary: [R] Support purrr-style lambda functions in .fns argument to across() Key: ARROW-17366 URL: https://issues.apache.org/jira/browse/ARROW-17366 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns argument does not yet support purrr-style lambda functions (e.g. {{~round(.x, digits = -1)}} but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)