date:20220809

[jira] [Updated] (ARROW-17304) [C++][Compute] Improve error messages in aggregate test

2022-08-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17304:
---
Priority: Minor  (was: Major)

> [C++][Compute] Improve error messages in aggregate test
> ---
>
> Key: ARROW-17304
> URL: https://issues.apache.org/jira/browse/ARROW-17304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Print actual values when comparison failed to help debugging.
> See https://github.com/apache/arrow/issues/12681



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17304) [C++][Compute] Improve error messages in aggregate test

2022-08-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17304.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13814
[https://github.com/apache/arrow/pull/13814]

> [C++][Compute] Improve error messages in aggregate test
> ---
>
> Key: ARROW-17304
> URL: https://issues.apache.org/jira/browse/ARROW-17304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Print actual values when comparison failed to help debugging.
> See https://github.com/apache/arrow/issues/12681



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17341) musl does not define _SC_LEVEL1_DCACHE_SIZE etc

2022-08-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17341.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13819
[https://github.com/apache/arrow/pull/13819]

> musl does not define _SC_LEVEL1_DCACHE_SIZE etc
> ---
>
> Key: ARROW-17341
> URL: https://issues.apache.org/jira/browse/ARROW-17341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Duncan Bellamy
>Assignee: Yibo Cai
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE 
> introduced in commit 
> [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71]
>  
> there is a fallback function
> [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326]
>  
> but that uses the structure with the defines in as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17341) [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc

2022-08-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17341:
---
Summary: [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc  (was: musl 
does not define _SC_LEVEL1_DCACHE_SIZE etc)

> [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc
> -
>
> Key: ARROW-17341
> URL: https://issues.apache.org/jira/browse/ARROW-17341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Duncan Bellamy
>Assignee: Yibo Cai
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE 
> introduced in commit 
> [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71]
>  
> there is a fallback function
> [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326]
>  
> but that uses the structure with the defines in as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17341) [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc

2022-08-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17341:
---
Fix Version/s: 9.0.1

> [C++] musl does not define _SC_LEVEL1_DCACHE_SIZE etc
> -
>
> Key: ARROW-17341
> URL: https://issues.apache.org/jira/browse/ARROW-17341
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Duncan Bellamy
>Assignee: Yibo Cai
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0, 9.0.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow 9.0.0 has new code which includes _SC_LEVEL1_DCACHE_SIZE 
> introduced in commit 
> [https://github.com/apache/arrow/commit/cde5a0800624649cd6558f339ded2024146cfd71]
>  
> there is a fallback function
> [https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/cpp/src/arrow/util/cpu_info.cc#L326]
>  
> but that uses the structure with the defines in as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14126) [C++] Add locale support for relevant string compute functions

2022-08-09 Thread Eduardo Ponce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577295#comment-17577295
 ] 

Eduardo Ponce commented on ARROW-14126:
---

I agree that a more general approach would be a better solution. Closing this 
issue as it is not relevant in its current form.

> [C++] Add locale support for relevant string compute functions
> --
>
> Key: ARROW-14126
> URL: https://issues.apache.org/jira/browse/ARROW-14126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Priority: Major
>  Labels: kernel
>
> String compute functions do not make use of locale settings for case changing 
> transformations, string comparisons, and number to string casting. Arrow does 
> provides UTF-8 string kernels to handle localization standardization. It 
> would be good to add locale support for string kernels that are affected by 
> it.
> The following are string functions that take a `locale` option as its second 
> argument:
> * str_to_lower
> * str_to_upper
> * str_to_title



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-14126) [C++] Add locale support for relevant string compute functions

2022-08-09 Thread Eduardo Ponce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce resolved ARROW-14126.
---
Resolution: Won't Do

> [C++] Add locale support for relevant string compute functions
> --
>
> Key: ARROW-14126
> URL: https://issues.apache.org/jira/browse/ARROW-14126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Priority: Major
>  Labels: kernel
>
> String compute functions do not make use of locale settings for case changing 
> transformations, string comparisons, and number to string casting. Arrow does 
> provides UTF-8 string kernels to handle localization standardization. It 
> would be good to add locale support for string kernels that are affected by 
> it.
> The following are string functions that take a `locale` option as its second 
> argument:
> * str_to_lower
> * str_to_upper
> * str_to_title



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-13570) [C++][Compute] Additional scalar ASCII kernels can reuse original offsets buffer

2022-08-09 Thread Eduardo Ponce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce resolved ARROW-13570.
---
Resolution: Duplicate

> [C++][Compute] Additional scalar ASCII kernels can reuse original offsets 
> buffer
> 
>
> Key: ARROW-13570
> URL: https://issues.apache.org/jira/browse/ARROW-13570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Priority: Major
>
> Some ASCII scalar string kernels are able to reuse the original offsets 
> buffer, so they are not preallocated in the output (use 
> *MemAllocation::NO_PREALLOCATE* during registration). Currently, only kernels 
> that apply a transformation to each character independently via 
> [StringDataTransform|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L590-L631]
>  support the no preallocation policy. But there are additional string kernels 
> that do not modify the length (nor offsets) of the input string but apply 
> scalar transforms that depend on neighboring characters.
> This issue should extend/create *StringDataTransform* to take multiple input 
> transforms in order to support *MemAllocation::NO_PREALLOCATE* policy for 
> additional scalar ASCII kernels (e.g., _ascii_title_).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0

2022-08-09 Thread Oliver Klein (Jira)

Oliver Klein created ARROW-17352:


 Summary: Parquet files cannot be opened in Windows Parquet Viewer 
when stored with Arrow Version 9.0.0
 Key: ARROW-17352
 URL: https://issues.apache.org/jira/browse/ARROW-17352
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet
Affects Versions: 9.0.0
 Environment: Windows10
Reporter: Oliver Klein
 Attachments: arrow9error.PNG

Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow 
Version 9.0.0. It worked when stored with version 8 and earlier.

Windows Parquet Viewer: 2.3.5 and 2.3.6

pyarrow version: 9.0.0

Error: System.AggregateException: One or more errors occured. ---> 
Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 

at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
DataColumnReader.cs: line 259

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577308#comment-17577308
 ] 

Joris Van den Bossche commented on ARROW-17335:
---

AFAIK it is not (yet) possible to do inline type annotations in cython code 
(for type checking purposes, see the links in 
https://github.com/apache/arrow/pull/6676 as well), so I think that basically 
means we need to use the stub file approach? 

(but I certainly agree it's fine to give this a go with a small subset, see how 
that looks, and discuss further from there)

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0

2022-08-09 Thread Oliver Klein (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Klein updated ARROW-17352:
-
Description: 
Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow 
Version 9.0.0. It worked when stored with version 8 and earlier.

Windows Parquet Viewer: 2.3.5 and 2.3.6

pyarrow version: 9.0.0

Error: System.AggregateException: One or more errors occured. ---> 
Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 

at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
DataColumnReader.cs: line 259

 

After further checking I found that it seems the problem seems to relate to a 
default parquet version change.

When I use pyarrow 9 and configure version to 1.0 it works again from the 
windows tool - when its 2.4 its not working (or supported in the windows tool).

df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')

Question might be if such a default change is a bug or a feature.

 

  was:
Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow 
Version 9.0.0. It worked when stored with version 8 and earlier.

Windows Parquet Viewer: 2.3.5 and 2.3.6

pyarrow version: 9.0.0

Error: System.AggregateException: One or more errors occured. ---> 
Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 

at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
DataColumnReader.cs: line 259

 

 


> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0
> -
>
> Key: ARROW-17352
> URL: https://issues.apache.org/jira/browse/ARROW-17352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
> Environment: Windows10
>Reporter: Oliver Klein
>Priority: Critical
> Attachments: arrow9error.PNG
>
>
> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0. It worked when stored with version 8 and earlier.
> Windows Parquet Viewer: 2.3.5 and 2.3.6
> pyarrow version: 9.0.0
> Error: System.AggregateException: One or more errors occured. ---> 
> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 
> at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
> DataColumnReader.cs: line 259
>  
> After further checking I found that it seems the problem seems to relate to a 
> default parquet version change.
> When I use pyarrow 9 and configure version to 1.0 it works again from the 
> windows tool - when its 2.4 its not working (or supported in the windows 
> tool).
> df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
> df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')
> Question might be if such a default change is a bug or a feature.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577314#comment-17577314
 ] 

Joris Van den Bossche commented on ARROW-17335:
---

{quote}Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type 
checking on the PyArrow code base and ensuring all the types are valid inside 
the library.

My ticket is about using the PyArrow code base as a library and ensuring we can 
type check projects that are using PyArrow by using type annotations on 
functions specified inside the PyArrow codebase.{quote}

It's indeed not exactly the same. But _in practice_, I think both aspects are 
very much related and we could (should?) do those at the same time. If we start 
adding type annotations so that pyarrow can used by other projects that are 
type-checked, it would be good that at the same time we also _check_ that those 
type annotations we are adding are correct (although, based on my limited 
experience with this, just running mypy on the code base is always a bit 
limited I suppose, as it doesn't guarantee the type checks are actually 
correct? (it only might find some incorrect ones))

 

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Nicola Crane (Jira)



[ https://issues.apache.org/jira/browse/ARROW-15943 ]


Nicola Crane deleted comment on ARROW-15943:
--

was (Author: thisisnic):
There is more user interest in implementing this feature: 
https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_
 

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577321#comment-17577321
 ] 

Nicola Crane commented on ARROW-15943:
--

There is more user interest in implementing this feature: 
https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_
 

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0

2022-08-09 Thread Oliver Klein (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Klein updated ARROW-17352:
-
Description: 
Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow 
Version 9.0.0. It worked when stored with version 8 and earlier.

Windows Parquet Viewer: 2.3.5 and 2.3.6

pyarrow version: 9.0.0

Error: System.AggregateException: One or more errors occured. ---> 
Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 

at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
DataColumnReader.cs: line 259

 

After further checking I found that it seems the problem seems to relate to a 
default parquet version change.

When I use pyarrow 9 and configure version to 1.0 it works again from the 
windows tool - when its 2.4 its not working (or supported in the windows tool).

df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')

Question might be if such a default change is a bug or a feature.

Finally found: 
 * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280)

So probably its a feature - and we need to adapt our code

 

  was:
Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow 
Version 9.0.0. It worked when stored with version 8 and earlier.

Windows Parquet Viewer: 2.3.5 and 2.3.6

pyarrow version: 9.0.0

Error: System.AggregateException: One or more errors occured. ---> 
Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 

at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
DataColumnReader.cs: line 259

 

After further checking I found that it seems the problem seems to relate to a 
default parquet version change.

When I use pyarrow 9 and configure version to 1.0 it works again from the 
windows tool - when its 2.4 its not working (or supported in the windows tool).

df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')

Question might be if such a default change is a bug or a feature.

 


> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0
> -
>
> Key: ARROW-17352
> URL: https://issues.apache.org/jira/browse/ARROW-17352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
> Environment: Windows10
>Reporter: Oliver Klein
>Priority: Critical
> Attachments: arrow9error.PNG
>
>
> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0. It worked when stored with version 8 and earlier.
> Windows Parquet Viewer: 2.3.5 and 2.3.6
> pyarrow version: 9.0.0
> Error: System.AggregateException: One or more errors occured. ---> 
> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 
> at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
> DataColumnReader.cs: line 259
>  
> After further checking I found that it seems the problem seems to relate to a 
> default parquet version change.
> When I use pyarrow 9 and configure version to 1.0 it works again from the 
> windows tool - when its 2.4 its not working (or supported in the windows 
> tool).
> df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
> df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')
> Question might be if such a default change is a bug or a feature.
> Finally found: 
>  * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280)
> So probably its a feature - and we need to adapt our code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577326#comment-17577326
 ] 

Nicola Crane commented on ARROW-15943:
--

There is more user interest in implementing this feature: 
https://stackoverflow.com/questions/73283669/r-arrowread-datasetmyfolder-only-read-some-folders-files

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17320) [Python] Refine pyarrow.parquet API exposure

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577330#comment-17577330
 ] 

Joris Van den Bossche commented on ARROW-17320:
---

On second thought, I think it is maybe fine to just "break" this is a next 
release by explicitly defining the {{\_\_all\_\_}}. If someone would run into 
it, it is easy to fix in your code. 

But starting to add deprecation warnings for those sounds quite onerous for the 
value it would provide.

(we only need to remember to add {{\_filters_to_expression}} to the list, since 
that is used by other projects)

> [Python] Refine pyarrow.parquet API exposure
> 
>
> Key: ARROW-17320
> URL: https://issues.apache.org/jira/browse/ARROW-17320
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Parquet, Python
>Reporter: Miles Granger
>Priority: Major
>
> Spawning from ARROW-17106, moving code from `pyarrow/parquet/_init_{_}` to 
> `pyarrow/parquet/core` and re-exporting in `_i{_}{_}nit_{_}` to maintain the 
> same functionality.
> [pyarrow._init_|https://github.com/apache/arrow/blob/master/python/pyarrow/__init__.py]is
>  very careful about what is exposed through the public API by prefixing 
> private symbols with underscores, even imports. 
> What's exposed at the top level of `{{{}pyarrow.parquet{}}}`, however, is not 
> so careful. API calls such as `{{{}pq.FileSystem{}}}`, `{{{}pq.pa.Array{}}}`, 
> `{{{}pq.json{}}}` are all valid and should probably be designated as private 
> attributes in {{{}pyarrow.parquet{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread H. Vetinari (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577332#comment-17577332
 ] 

H. Vetinari commented on ARROW-17110:
-

Just FYI, conda-forge 
[now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides 
static builds of C++11/C++14 as "escape hatches" for packages that cannot yet 
use the C++17 dynamic libs. This takes the heat off a little bit - i.e. it 
allows packages to move at their own speed w.r.t to C++, as opposed to forcing 
a conda-forge-wide choice for abseil -, but note that the next abseil version 
will still drop C++11 compatibility, so a move to at least C++14 will still be 
necessary in the near-ish future.

> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread H. Vetinari (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577332#comment-17577332
 ] 

H. Vetinari edited comment on ARROW-17110 at 8/9/22 10:10 AM:
--

Just FYI, conda-forge 
[now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides 
static builds of C\+\+11/C\+\+14 as "escape hatches" for packages that cannot 
yet use the C\+\+17 dynamic libs. This takes the heat off a little bit - i.e. 
it allows packages to move at their own speed w.r.t to C\+\+, as opposed to 
forcing a conda-forge-wide choice for abseil -, but note that the next abseil 
version will still drop C\+\+11 compatibility, so a move to at least C\+\+14 
will still be necessary in the near-ish future.


was (Author: h-vetinari):
Just FYI, conda-forge 
[now|https://github.com/conda-forge/abseil-cpp-feedstock/pull/35] provides 
static builds of C++11/C++14 as "escape hatches" for packages that cannot yet 
use the C++17 dynamic libs. This takes the heat off a little bit - i.e. it 
allows packages to move at their own speed w.r.t to C++, as opposed to forcing 
a conda-forge-wide choice for abseil -, but note that the next abseil version 
will still drop C++11 compatibility, so a move to at least C++14 will still be 
necessary in the near-ish future.

> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577335#comment-17577335
 ] 

Antoine Pitrou commented on ARROW-17110:


[~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's 
possible to require C\+\+17 only for GCS-enabled builds.

The important issue is about moving away from C\+\+11 for the_ whole codebase_, 
i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional 
dependency which requires it.


> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577335#comment-17577335
 ] 

Antoine Pitrou edited comment on ARROW-17110 at 8/9/22 10:18 AM:
-

[~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's 
possible to require C\+\+17 only for GCS-enabled builds.

The important issue is about moving away from C\+\+11 for the _whole codebase_, 
i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional 
dependency which requires it.



was (Author: pitrou):
[~h-vetinari] The Abseil discussion is not very interesting IMHO, because it's 
possible to require C\+\+17 only for GCS-enabled builds.

The important issue is about moving away from C\+\+11 for the_ whole codebase_, 
i.e. adopt C\+\+17 features in Arrow C\+\+ itself, not just have an optional 
dependency which requires it.


> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread H. Vetinari (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577342#comment-17577342
 ] 

H. Vetinari commented on ARROW-17110:
-

Sure, I was only commenting from the POV of abseil; I was not aware how deeply 
enmeshed (or not) this is with the rest of arrow. If you can move the parts 
depending on abseil to C++14 separately (and presumably not build them for 
various older runtimes), then there's less urgency.

> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17252) [R] Intermittent valgrind failure

2022-08-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-17252:
---

Assignee: Dewey Dunnington

> [R] Intermittent valgrind failure
> -
>
> Key: ARROW-17252
> URL: https://issues.apache.org/jira/browse/ARROW-17252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> A number of recent nightly builds have intermittent failures with valgrind, 
> which fails because of possibly leaked memory around an exec plan. This seems 
> related to a change in XXX that separated {{ExecPlan_prepare()}} from 
> {{ExecPlan_run()}} and added a {{ExecPlan_read_table()}} that uses 
> {{RunWithCapturedR()}}. The reported leaks vary but include ExecPlans and 
> ExecNodes and fields of those objects.
> A failed run: 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30310&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=24980
> Some example output:
> {noformat}
> ==5249== 14,112 (384 direct, 13,728 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,988 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10B2902B: 
> std::_Function_handler 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&), 
> arrow::compute::internal::RegisterAggregateNode(arrow::compute::ExecFactoryRegistry*)::{lambda(arrow::compute::ExecPlan*,
>  std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)#1}>::_M_invoke(std::_Any_data const&, arrow::compute::ExecPlan*&&, 
> std::vector std::allocator >&&, 
> arrow::compute::ExecNodeOptions const&) (exec_plan.h:60)
> ==5249==by 0xFA83A0C: 
> std::function 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)>::operator()(arrow::compute::ExecPlan*, 
> std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&) const (std_function.h:622)
> ==5249== 14,528 (160 direct, 14,368 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,989 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10096CB7: arrow::FutureImpl::Make() (future.cc:187)
> ==5249==by 0xFCB6F9A: arrow::Future::Make() 
> (future.h:420)
> ==5249==by 0x101AE927: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AE927: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249== 
> ==5249== 36,322 (416 direct, 35,906 indirect) bytes in 1 blocks are 
> definitely lost in loss record 2,929 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10214F92: arrow::compute::TaskScheduler::Make() 
> (task_util.cc:421)
> ==5249==by 0x101AEA6C: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AEA6C: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> {noformat}
> We also occasionally get leaked Schemas, and in one case a leaked InputType 
> that seemed completely unrelated to the other leaks (ARROW-17225).
> I'm wondering if these have to do with references in lambdas that get passed 
> by reference? Or perhaps a cache issue? There were some instances in previous 
> leaks where the backtrace to the {{new}} allocator was different between 
> reported leaks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17252) [R] Intermittent valgrind failure

2022-08-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-17252.
-
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13773
[https://github.com/apache/arrow/pull/13773]

> [R] Intermittent valgrind failure
> -
>
> Key: ARROW-17252
> URL: https://issues.apache.org/jira/browse/ARROW-17252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> A number of recent nightly builds have intermittent failures with valgrind, 
> which fails because of possibly leaked memory around an exec plan. This seems 
> related to a change in XXX that separated {{ExecPlan_prepare()}} from 
> {{ExecPlan_run()}} and added a {{ExecPlan_read_table()}} that uses 
> {{RunWithCapturedR()}}. The reported leaks vary but include ExecPlans and 
> ExecNodes and fields of those objects.
> A failed run: 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30310&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=24980
> Some example output:
> {noformat}
> ==5249== 14,112 (384 direct, 13,728 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,988 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10B2902B: 
> std::_Function_handler 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&), 
> arrow::compute::internal::RegisterAggregateNode(arrow::compute::ExecFactoryRegistry*)::{lambda(arrow::compute::ExecPlan*,
>  std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)#1}>::_M_invoke(std::_Any_data const&, arrow::compute::ExecPlan*&&, 
> std::vector std::allocator >&&, 
> arrow::compute::ExecNodeOptions const&) (exec_plan.h:60)
> ==5249==by 0xFA83A0C: 
> std::function 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)>::operator()(arrow::compute::ExecPlan*, 
> std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&) const (std_function.h:622)
> ==5249== 14,528 (160 direct, 14,368 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,989 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10096CB7: arrow::FutureImpl::Make() (future.cc:187)
> ==5249==by 0xFCB6F9A: arrow::Future::Make() 
> (future.h:420)
> ==5249==by 0x101AE927: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AE927: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249== 
> ==5249== 36,322 (416 direct, 35,906 indirect) bytes in 1 blocks are 
> definitely lost in loss record 2,929 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10214F92: arrow::compute::TaskScheduler::Make() 
> (task_util.cc:421)
> ==5249==by 0x101AEA6C: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AEA6C: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> {noformat}
> We also occasionally get leaked Schemas, and in one case a leaked InputType 
> that seemed completely unrelated to the other leaks (ARROW-17225).
> I'm wondering if these have to do with references in lambdas that get passed 
> by reference? Or perhaps a cache issue? There were some instances in previous 
> leaks where the backtrace to the

[jira] [Assigned] (ARROW-13763) [Python] Files opened for read with pyarrow.parquet are not explicitly closed

2022-08-09 Thread Miles Granger (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger reassigned ARROW-13763:
-

Assignee: Miles Granger  (was: Alessandro Molina)

> [Python] Files opened for read with pyarrow.parquet are not explicitly closed
> -
>
> Key: ARROW-13763
> URL: https://issues.apache.org/jira/browse/ARROW-13763
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
> Environment: fsspec 2021.4.0
>Reporter: Richard Kimoto
>Assignee: Miles Granger
>Priority: Major
> Fix For: 10.0.0
>
> Attachments: test.py
>
>
> It appears that files opened for read using pyarrow.parquet.read_table (and 
> therefore pyarrow.parquet.ParquetDataset) are not explicitly closed.  
> This seems to be the case for both use_legacy_dataset=True and False.  The 
> files don't remain open at the os level (verified using lsof).  They do 
> however seem to rely on the python gc to close.  
> My use case is that i'd like to use a custom fsspec filesystem that 
> interfaces to an s3 like API. It handles the remote download of the parquet 
> file and passes to pyarrow a handle of a temporary file downloaded locally.  
> It then is looking for an explicit close() or __exit__() to then clean up the 
> temp file.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17332) [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17332:
-
Summary: [R] error parsing folder path with accent ('c:/Público') in 
read_csv_arrow  (was: [R package] error parsing folder path with accent 
('c:/Público') in read_csv_arrow)

> [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow
> --
>
> Key: ARROW-17332
> URL: https://issues.apache.org/jira/browse/ARROW-17332
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> I am a user trying the R arrow package on a windows machine. 
> To reproduce create a folder name containing a character with Latin accents
> ```
> libary(arrow)
> p <- 'c:/Público'  
> b <- read_csv_arrow(p)
> Error: IOError: Failed to open local file 'c:/PÃºblico'. Detail: [Windows 
> error 5] Access is denied.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17332) [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17332:
-
Component/s: R

> [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow
> --
>
> Key: ARROW-17332
> URL: https://issues.apache.org/jira/browse/ARROW-17332
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Major
>
> I am a user trying the R arrow package on a windows machine. 
> To reproduce create a folder name containing a character with Latin accents
> ```
> libary(arrow)
> p <- 'c:/Público'  
> b <- read_csv_arrow(p)
> Error: IOError: Failed to open local file 'c:/PÃºblico'. Detail: [Windows 
> error 5] Access is denied.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17331) [R] path with acent (ex: c:/Público)

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-17331.

Resolution: Duplicate

> [R] path with acent (ex: c:/Público)
> 
>
> Key: ARROW-17331
> URL: https://issues.apache.org/jira/browse/ARROW-17331
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-13763) [Python] Files opened for read with pyarrow.parquet are not explicitly closed

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13763:
---
Labels: pull-request-available  (was: )

> [Python] Files opened for read with pyarrow.parquet are not explicitly closed
> -
>
> Key: ARROW-13763
> URL: https://issues.apache.org/jira/browse/ARROW-13763
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 5.0.0
> Environment: fsspec 2021.4.0
>Reporter: Richard Kimoto
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
> Attachments: test.py
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It appears that files opened for read using pyarrow.parquet.read_table (and 
> therefore pyarrow.parquet.ParquetDataset) are not explicitly closed.  
> This seems to be the case for both use_legacy_dataset=True and False.  The 
> files don't remain open at the os level (verified using lsof).  They do 
> however seem to rely on the python gc to close.  
> My use case is that i'd like to use a custom fsspec filesystem that 
> interfaces to an s3 like API. It handles the remote download of the parquet 
> file and passes to pyarrow a handle of a temporary file downloaded locally.  
> It then is looking for an explicit close() or __exit__() to then clean up the 
> temp file.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17312) [R] R session aborts when using dplyr::filter after setting as_data_frame = FALSE in arrow::read_csv_arrow

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-17312.

Resolution: Not A Bug

> [R] R session aborts when using dplyr::filter after setting as_data_frame = 
> FALSE in arrow::read_csv_arrow
> --
>
> Key: ARROW-17312
> URL: https://issues.apache.org/jira/browse/ARROW-17312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Neil Currie
>Priority: Major
>
> R session aborts / encounters fatal error when using dplyr::filter after 
> setting as_data_frame = FALSE in arrow::read_csv_arrow.
>  
> Version info:
> platform [1] "x86_64-apple-darwin17.0"
> R version 4.2.0 (2022-04-22)
> Running on MacBook Air, macOS Monterey v12.4, Apple M1 chip
>  
> Reproducible example:
>  
> {{if (!require(pacman)) install.packages("pacman")}}
> {{library(pacman)}}
> {{p_load(arrow, dplyr, readr)}}
>  
> {{write_csv(starwars, "starwars.csv")}}
>  
> {{read_csv_arrow("starwars.csv", as_data_frame = FALSE) |> }}{{filter(height 
> >= 150)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17279) [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, as.logical(local), as.logical(now), ...):

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17279:
-
Summary: [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, 
as.logical(local), as.logical(now), ...):  (was: Error: package or namespace 
load failed for ‘arrow’ in inDL(x, as.logical(local), as.logical(now), ...):)

> [R] Error: package or namespace load failed for ‘arrow’ in inDL(x, 
> as.logical(local), as.logical(now), ...):
> 
>
> Key: ARROW-17279
> URL: https://issues.apache.org/jira/browse/ARROW-17279
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.1
> Environment: R version 4.1.0 (2021-05-18) 
>Reporter: Cristian Adir Cardona Merchan
>Priority: Major
>  Labels: R, arrow, install, library
> Fix For: 8.0.1
>
> Attachments: rstudio_FYbWALFFSG.jpg
>
>
> I have uploaded R version 4.1.0 (2021-05-18). After installing the arrow 
> package when I require it I got the message below, and also I have installed 
> Rtoos 4.0 related to R version.
> Error message:
>  
> {{> library(arrow)}}
> Error: package or namespace load failed for ‘arrow’ in inDL(x, 
> as.logical(local), as.logical(now), ...): unable to load shared object 
> 'C:/Users/el_ki/Documents/R/win-library/4.1/arrow/libs/x64/arrow.dll': 
> LoadLibrary failure: Error en una rutina de inicialización de biblioteca de 
> vínculos dinámicos (DLL).
> {{}}
> {quote}sessionInfo() R version 4.1.0 (2021-05-18) Platform: 
> x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)
> {quote}
> {{}}
> Matrix products: default
> {{}}
> locale: [1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 
> LC_MONETARY=Spanish_Colombia.1252 [4] LC_NUMERIC=C 
> LC_TIME=Spanish_Colombia.1252
> {{}}
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base
> {{}}
> loaded via a namespace (and not attached): [1] tidyselect_1.1.2 bit_4.0.4 
> compiler_4.1.0 magrittr_2.0.3 assertthat_0.2.1 R6_2.5.1
> [7] cli_3.3.0 tools_4.1.0 glue_1.6.2 rstudioapi_0.13 bit64_4.0.5 vctrs_0.4.1
> [13] data.table_1.14.2 rlang_1.0.4 purrr_0.3.4
> Thank you very much for your time. 
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17208) [R] Removing files after reading them in R

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17208:
-
Summary: [R] Removing files after reading them in R  (was: Removing files 
after reading them in R)

> [R] Removing files after reading them in R
> --
>
> Key: ARROW-17208
> URL: https://issues.apache.org/jira/browse/ARROW-17208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.1
>Reporter: Wytze Gelderloos
>Priority: Minor
>
> In R it's not possible to delete the files eventhough the dataframe is 
> cleared from the R environment. 
>  
> write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE)
> df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", 
> delimiter = ","))
> df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect()
> ## Do some checks on df.
> rm(df)
> file.remove("mtcars.csv")
> The `file.remove` leads to a Permission Denied error eventhough the dataframe 
> is cleared from the R environment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17208) [R] Removing files after reading them in R

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17208:
-
Description: 
In R it's not possible to delete the files eventhough the dataframe is cleared 
from the R environment. 


{code:r}
write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE)
df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", 
delimiter = ","))

df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect()

## Do some checks on df.

rm(df)
file.remove("mtcars.csv")
{code}

The `file.remove` leads to a Permission Denied error eventhough the dataframe 
is cleared from the R environment. 

  was:
In R it's not possible to delete the files eventhough the dataframe is cleared 
from the R environment. 

 
write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE)
df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", 
delimiter = ","))

df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect()

## Do some checks on df.

rm(df)
file.remove("mtcars.csv")

The `file.remove` leads to a Permission Denied error eventhough the dataframe 
is cleared from the R environment. 


> [R] Removing files after reading them in R
> --
>
> Key: ARROW-17208
> URL: https://issues.apache.org/jira/browse/ARROW-17208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.1
>Reporter: Wytze Gelderloos
>Priority: Minor
>
> In R it's not possible to delete the files eventhough the dataframe is 
> cleared from the R environment. 
> {code:r}
> write.csv(mtcars, file = "mtcars.csv", quote = FALSE, row.names = FALSE)
> df <- arrow::to_duckdb(arrow::open_dataset("mtcars.csv", format = "csv", 
> delimiter = ","))
> df <- df %>% select(c("mpg", "disp", "drat", "wt")) %>% collect()
> ## Do some checks on df.
> rm(df)
> file.remove("mtcars.csv")
> {code}
> The `file.remove` leads to a Permission Denied error eventhough the dataframe 
> is cleared from the R environment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17353) [Release] R libarrow binaries have the wrong version number

2022-08-09 Thread Jacob Wujciak-Jens (Jira)

Jacob Wujciak-Jens created ARROW-17353:
--

 Summary: [Release] R libarrow binaries have the wrong version 
number
 Key: ARROW-17353
 URL: https://issues.apache.org/jira/browse/ARROW-17353
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Affects Versions: 9.0.0
Reporter: Jacob Wujciak-Jens
 Fix For: 10.0.0


The libarrow binaries that are uploaded during the release process have the 
wrong version number. This is an issue with the submit binaries 
script/r-binary-packages job. The arrow version should be picked up by the job 
even if not passed explicitly as a custom param.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17113) [Java] All static initializers should catch and report exceptions

2022-08-09 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17113.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13678
[https://github.com/apache/arrow/pull/13678]

> [Java] All static initializers should catch and report exceptions
> -
>
> Key: ARROW-17113
> URL: https://issues.apache.org/jira/browse/ARROW-17113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As reported on the mailing list: 
> https://lists.apache.org/thread/gysn25gsm4v1fvvx9l0sjyr627xy7q65
> All static initializers should catch and report exceptions, or else they will 
> get swallowed by the JVM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17354) [C++] ARROW_ONLY_LINT should not require xsimd

2022-08-09 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-17354:
--

 Summary: [C++] ARROW_ONLY_LINT should not require xsimd
 Key: ARROW-17354
 URL: https://issues.apache.org/jira/browse/ARROW-17354
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


When I try to configure Arrow with {{-DARROW_ONLY_LINT=on}} passed to CMake, I 
still get the following error:
{code:java}

CMake Error at cmake_modules/ThirdpartyToolchain.cmake:267 (find_package):
  By not providing "Findxsimd.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "xsimd", but
  CMake did not find one.

  Could not find a package configuration file provided by "xsimd" (requested
  version 8.1.0) with any of the following names:

xsimdConfig.cmake
xsimd-config.cmake

  Add the installation prefix of "xsimd" to CMAKE_PREFIX_PATH or set
  "xsimd_DIR" to a directory containing one of the above files.  If "xsimd"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  cmake_modules/ThirdpartyToolchain.cmake:2245 (resolve_dependency)
  CMakeLists.txt:575 (include)

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-08-09 Thread Ben Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577386#comment-17577386
 ] 

Ben Kietzman commented on ARROW-17093:
--

... however, having written that I think the correct solution to the 
all-threads-trace problem is allowing the process to core dump then reading 
stacks out of that. This has two advantages over in-process tracing:
- When a signal handler exists, the non-signaled threads continue execution 
until they receive signals of their own. However if a signal is known to be 
fatal, the OS can shut threads down more aggressively- this means we can get a 
less out-of-date traces from the threads which *didn't* segfault than we can 
with interthread signals
- We'd probably be reading the core dump with gdb or another debugger and we'd 
have access to the process' full memory, so we could print not just snippets of 
the source files but values of local variables as well

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-08-09 Thread Ben Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577386#comment-17577386
 ] 

Ben Kietzman edited comment on ARROW-17093 at 8/9/22 12:26 PM:
---

... however, having written that I think the correct solution to the 
all-threads-trace problem is allowing the process to core dump then reading 
stacks out of that. This has two advantages over in-process tracing:
- When a signal handler exists, the non-signaled threads continue execution 
until they receive signals of their own. However if a signal is known to be 
fatal, the OS can shut threads down more aggressively- this means we can get 
less out-of-date traces from the threads which *didn't* segfault than we can 
with interthread signals
- We'd probably be reading the core dump with gdb or another debugger and we'd 
have access to the process' full memory, so we could print not just snippets of 
the source files but values of local variables as well


was (Author: bkietz):
... however, having written that I think the correct solution to the 
all-threads-trace problem is allowing the process to core dump then reading 
stacks out of that. This has two advantages over in-process tracing:
- When a signal handler exists, the non-signaled threads continue execution 
until they receive signals of their own. However if a signal is known to be 
fatal, the OS can shut threads down more aggressively- this means we can get a 
less out-of-date traces from the threads which *didn't* segfault than we can 
with interthread signals
- We'd probably be reading the core dump with gdb or another debugger and we'd 
have access to the process' full memory, so we could print not just snippets of 
the source files but values of local variables as well

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Jorrick Sleijster (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392
 ] 

Jorrick Sleijster commented on ARROW-17335:
---

Agreeing with you Joris but as you mention, I don't think we can use inline 
type annotations :'(. Therefore, we'd have to use generated stubs, which we 
can't use for checking whether the underlying code actually has the right types.

I think we will therefore have to wait (or take action ourselves upstream) 
until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for new and start of with 
stub generation, which can then later be replaced by a better implementation 
once available.

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Jorrick Sleijster (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392
 ] 

Jorrick Sleijster edited comment on ARROW-17335 at 8/9/22 12:32 PM:


I think you make a good point Joris but as you mention, I don't think we can 
use inline type annotations :'(. Therefore, we'd have to use generated stubs, 
which we can't use for checking whether the underlying code actually has the 
right types.

I think we will therefore have to wait (or take action ourselves upstream) 
until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for now and start of with 
stub generation, which can then later be replaced by a better implementation 
once available.


was (Author: JIRAUSER294085):
I think you make a good point Joris but as you mention, I don't think we can 
use inline type annotations :'(. Therefore, we'd have to use generated stubs, 
which we can't use for checking whether the underlying code actually has the 
right types.

I think we will therefore have to wait (or take action ourselves upstream) 
until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for new and start of with 
stub generation, which can then later be replaced by a better implementation 
once available.

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Jorrick Sleijster (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577392#comment-17577392
 ] 

Jorrick Sleijster edited comment on ARROW-17335 at 8/9/22 12:32 PM:


I think you make a good point Joris but as you mention, I don't think we can 
use inline type annotations :'(. Therefore, we'd have to use generated stubs, 
which we can't use for checking whether the underlying code actually has the 
right types.

I think we will therefore have to wait (or take action ourselves upstream) 
until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for new and start of with 
stub generation, which can then later be replaced by a better implementation 
once available.


was (Author: JIRAUSER294085):
Agreeing with you Joris but as you mention, I don't think we can use inline 
type annotations :'(. Therefore, we'd have to use generated stubs, which we 
can't use for checking whether the underlying code actually has the right types.

I think we will therefore have to wait (or take action ourselves upstream) 
until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for new and start of with 
stub generation, which can then later be replaced by a better implementation 
once available.

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17355:


 Summary: [R] Refactor the handle_* utility functions for a better 
dev experience
 Key: ARROW-17355
 URL: https://issues.apache.org/jira/browse/ARROW-17355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


In ARROW-15260, the utility functions for handling different kinds of reading 
errors (handle_parquet_io_error, handle_csv_read_error, and 
handle_augmented_field_misuse) were refactored so that multiple ones could be 
chained together. An issue with this is that other errors may be swallowed if 
they're used without any errors that they don't capture being raised manually 
afterwards.  We should update the code to prevent this from being possible.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17356) [R] Update binding for add_filename() NSE function to error if used on Table

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17356:


 Summary: [R] Update binding for add_filename() NSE function to 
error if used on Table
 Key: ARROW-17356
 URL: https://issues.apache.org/jira/browse/ARROW-17356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-15260 adds a function which allows the user to add the filename as an 
output field.  This function only makes sense to use with datasets and not 
tables.  Currently, the error generated from using it with a table is handled 
by {{handle_augmented_field_misuse()}}.  Instead, we should follow [one of the 
suggestions from the 
PR|https://github.com/apache/arrow/pull/12826#issuecomment-1192007298] to 
detect this when the function is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17349) [C++] Support casting field names of list and map when nested

2022-08-09 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17349:
-
Labels: good-first-issue kernel  (was: )

> [C++] Support casting field names of list and map when nested
> -
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
>  Labels: good-first-issue, kernel
> Fix For: 10.0.0
>
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File "", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map list> from map ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17110) [C++] Move away from C++11

2022-08-09 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577414#comment-17577414
 ] 

Kouhei Sutou commented on ARROW-17110:
--

I think that we can switch to C\+\+14 or C\+\+17. Because it seems that we can 
mix a binary built with the default g\+\+ and a binary built with the 
debtoolset's g\+\+ in the same process on CentOS 7.

I think that the following 2 cases:

1. Build Arrow with the devtoolset's g\+\+ and use the built Arrow as a library 
for a C++ program that is built with the default g\+\+.
2. {{dlopen()}} Arrow built with the devtoolset's g\+\+ and a library built 
with the default g\+\+ in the same process.

1. is meaningless for us as Antoine said. Sorry.

2. may be happen with Ruby. For example, {{ruby -r red-arrow -r unf_ext -e 
'nil'}}. ({{unf_ext}} is one of Ruby libraries that use C++.)

It seems that 2. works too. So I think that we can switch to C\+\+14 or C\+\+17.

> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES

2022-08-09 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577430#comment-17577430
 ] 

Kouhei Sutou commented on ARROW-17329:
--

Sorry... I said wrong CMake variable name...

Could you use {{-DCMAKE_FIND_DEBUG_MODE=ON}} instead of 
{{-DCMAKE_FIND_PACKAGE_DEBUG=ON}}?

> Build fails on alpine linux for arrow 9.0.0, /usr/local/include in 
> INTERFACE_INCLUDE_DIRECTORIES
> 
>
> Key: ARROW-17329
> URL: https://issues.apache.org/jira/browse/ARROW-17329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: alpine linux edge
>Reporter: Duncan Bellamy
>Priority: Blocker
> Fix For: 9.0.1
>
>
> zstd can now only be found if  ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is 
> passed to cmake
> trying to compile 9.0.0.0 I now get the error:
> {noformat}
> ??-- Configuring done??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/dataset/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??– Generating done??
> {noformat}
> /usr/local/include does not exist in my build environment, or the builders 
> for alpine linux



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17357) [CI][Conan] Enable JSON

2022-08-09 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17357:


 Summary: [CI][Conan] Enable JSON
 Key: ARROW-17357
 URL: https://issues.apache.org/jira/browse/ARROW-17357
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17357) [CI][Conan] Enable JSON

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17357:
---
Labels: pull-request-available  (was: )

> [CI][Conan] Enable JSON
> ---
>
> Key: ARROW-17357
> URL: https://issues.apache.org/jira/browse/ARROW-17357
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES

2022-08-09 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458
 ] 

Kouhei Sutou commented on ARROW-17329:
--

It seems that Alpin Linux edge's {{libzstd.pc}} is broken:

{noformat}
{noformat}

> Build fails on alpine linux for arrow 9.0.0, /usr/local/include in 
> INTERFACE_INCLUDE_DIRECTORIES
> 
>
> Key: ARROW-17329
> URL: https://issues.apache.org/jira/browse/ARROW-17329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: alpine linux edge
>Reporter: Duncan Bellamy
>Priority: Blocker
> Fix For: 9.0.1
>
>
> zstd can now only be found if  ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is 
> passed to cmake
> trying to compile 9.0.0.0 I now get the error:
> {noformat}
> ??-- Configuring done??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/dataset/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??– Generating done??
> {noformat}
> /usr/local/include does not exist in my build environment, or the builders 
> for alpine linux



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES

2022-08-09 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458
 ] 

Kouhei Sutou edited comment on ARROW-17329 at 8/9/22 2:38 PM:
--

It seems that Alpin Linux edge's {{libzstd.pc}} is broken:

{noformat}
$ cat /usr/lib/pkgconfig/libzstd.pc 
#   ZSTD - standard compression algorithm
#   Copyright (C) 2014-2016, Yann Collet, Facebook
#   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)

prefix=/usr/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib

Name: zstd
Description: fast lossless compression algorithm library
URL: http://www.zstd.net/
Version: 1.5.2
Libs: -L${libdir} -lzstd
Libs.private: -pthread
Cflags: -I${includedir}
{noformat}

It uses {{/usr/local}} as prefix.


was (Author: kou):
It seems that Alpin Linux edge's {{libzstd.pc}} is broken:

{noformat}
{noformat}

> Build fails on alpine linux for arrow 9.0.0, /usr/local/include in 
> INTERFACE_INCLUDE_DIRECTORIES
> 
>
> Key: ARROW-17329
> URL: https://issues.apache.org/jira/browse/ARROW-17329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: alpine linux edge
>Reporter: Duncan Bellamy
>Priority: Blocker
> Fix For: 9.0.1
>
>
> zstd can now only be found if  ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is 
> passed to cmake
> trying to compile 9.0.0.0 I now get the error:
> {noformat}
> ??-- Configuring done??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/dataset/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??– Generating done??
> {noformat}
> /usr/local/include does not exist in my build environment, or the builders 
> for alpine linux



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES

2022-08-09 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577458#comment-17577458
 ] 

Kouhei Sutou edited comment on ARROW-17329 at 8/9/22 2:38 PM:
--

It seems that Alpine Linux edge's {{libzstd.pc}} is broken:

{noformat}
$ cat /usr/lib/pkgconfig/libzstd.pc 
#   ZSTD - standard compression algorithm
#   Copyright (C) 2014-2016, Yann Collet, Facebook
#   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)

prefix=/usr/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib

Name: zstd
Description: fast lossless compression algorithm library
URL: http://www.zstd.net/
Version: 1.5.2
Libs: -L${libdir} -lzstd
Libs.private: -pthread
Cflags: -I${includedir}
{noformat}

It uses {{/usr/local}} as prefix.


was (Author: kou):
It seems that Alpin Linux edge's {{libzstd.pc}} is broken:

{noformat}
$ cat /usr/lib/pkgconfig/libzstd.pc 
#   ZSTD - standard compression algorithm
#   Copyright (C) 2014-2016, Yann Collet, Facebook
#   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)

prefix=/usr/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib

Name: zstd
Description: fast lossless compression algorithm library
URL: http://www.zstd.net/
Version: 1.5.2
Libs: -L${libdir} -lzstd
Libs.private: -pthread
Cflags: -I${includedir}
{noformat}

It uses {{/usr/local}} as prefix.

> Build fails on alpine linux for arrow 9.0.0, /usr/local/include in 
> INTERFACE_INCLUDE_DIRECTORIES
> 
>
> Key: ARROW-17329
> URL: https://issues.apache.org/jira/browse/ARROW-17329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: alpine linux edge
>Reporter: Duncan Bellamy
>Priority: Blocker
> Fix For: 9.0.1
>
>
> zstd can now only be found if  ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is 
> passed to cmake
> trying to compile 9.0.0.0 I now get the error:
> {noformat}
> ??-- Configuring done??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/dataset/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??– Generating done??
> {noformat}
> /usr/local/include does not exist in my build environment, or the builders 
> for alpine linux



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17299) [C++] [Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17299:
---
Labels: pull-request-available  (was: )

> [C++] [Python] Expose the Scanner kDefaultBatchReadahead and 
> kDefaultFragmentReadahead parameters
> -
>
> Key: ARROW-17299
> URL: https://issues.apache.org/jira/browse/ARROW-17299
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the Scanner there are parameters kDefaultFragmentReadahead and 
> kDefaultBatchReadahead that are currently set to fixed numbers that cannot be 
> changed.
> This is not great because tuning these numbers is the key to tradeoff RAM 
> usage and network IO utilization during reading. For example on an i3.2xlarge 
> instance on AWS you can get peak throughput only by quadrupling 
> kDefaultFragmentReadahead from the default. 
> The current settings are very conservative and assume a < 1Gbps network. 
> Exposing them allow people to tune the Scanner behavior to their own 
> hardware. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17358) [CI][C++] Add a job for Alpine Linux

2022-08-09 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17358:


 Summary: [CI][C++] Add a job for Alpine Linux
 Key: ARROW-17358
 URL: https://issues.apache.org/jira/browse/ARROW-17358
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17295) [C++] Build separate bundled_depenencies.so

2022-08-09 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones closed ARROW-17295.
--
Resolution: Won't Fix

> [C++] Build separate bundled_depenencies.so
> ---
>
> Key: ARROW-17295
> URL: https://issues.apache.org/jira/browse/ARROW-17295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0, 8.0.1
>Reporter: Will Jones
>Priority: Major
>
> When building arrow _static_ libraries with bundled dependencies, we produce 
> {{{}arrow_bundled_dependencies.a{}}}. But when building dynamic libraries, 
> the bundled dependencies are statically linked directly into the arrow 
> libraries (libarrow, libarrow_flight, etc.). This means that users can access 
> the symbols of bundled dependencies in the static case, but not in the 
> dynamic library case.
> One use case of this is being able to pass in gRPC configuration to a Flight 
> server, which requires access to gRPC symbols.
> Could we change the dynamic library building to build an 
> {{arrow_bundled_dependencies.so}} so that the symbols are accessible?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15943:

Labels: dataset  (was: )

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17358) [CI][C++] Add a job for Alpine Linux

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17358:
---
Labels: pull-request-available  (was: )

> [CI][C++] Add a job for Alpine Linux
> 
>
> Key: ARROW-17358
> URL: https://issues.apache.org/jira/browse/ARROW-17358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577474#comment-17577474
 ] 

Weston Pace commented on ARROW-15943:
-

I could see a few different ways this could be implemented:

 * We could add support for exclusion / inclusion filters for dataset 
discovery.  These could be regular expressions that are applied to the 
filenames to determine whether we should or should not include them.
 * We could do more to support custom partitioning functions.  The user could 
then create their own partitioning which includes this part of the filename as 
a partitioning column.
 * We could (not sure if we support this today or not) make sure we support 
filtering based on the filename column.  However, this approach has the 
downside of loading all the unwanted data into memory.

Do any of those approaches seem more appealing than the others?

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17329) Build fails on alpine linux for arrow 9.0.0, /usr/local/include in INTERFACE_INCLUDE_DIRECTORIES

2022-08-09 Thread Duncan Bellamy (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577480#comment-17577480
 ] 

Duncan Bellamy commented on ARROW-17329:


Thanks for finding that, strange how it wasn't causing an error before.  Unless 
it was changed with an update, will report that and try building after its 
fixed.

> Build fails on alpine linux for arrow 9.0.0, /usr/local/include in 
> INTERFACE_INCLUDE_DIRECTORIES
> 
>
> Key: ARROW-17329
> URL: https://issues.apache.org/jira/browse/ARROW-17329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
> Environment: alpine linux edge
>Reporter: Duncan Bellamy
>Priority: Blocker
> Fix For: 9.0.1
>
>
> zstd can now only be found if  ??{{-DZSTD_LIB=/usr/lib/libzstd.so}}?? is 
> passed to cmake
> trying to compile 9.0.0.0 I now get the error:
> {noformat}
> ??-- Configuring done??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??CMake Error in src/arrow/dataset/CMakeLists.txt:??
> ??Imported target "zstd::libzstd" includes non-existent path??
> ??"/usr/local/include"??
> ??in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:??
>  * ??The path was deleted, renamed, or moved to another location.??
>  * ??An install or uninstall procedure did not complete successfully.??
>  * ??The installation package was faulty and references files it does not??
> ??provide.??
> ??– Generating done??
> {noformat}
> /usr/local/include does not exist in my build environment, or the builders 
> for alpine linux



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17347) [C++][Docs] Describe limitations and alternatives for handling dependencies via package managers

2022-08-09 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577482#comment-17577482
 ] 

Will Jones commented on ARROW-17347:


We may also wish to mention the bundled dependencies, and explain why they are 
generally a last resort. See also discussion: 
https://lists.apache.org/thread/hgory6jkqg2vcqsw36635gsqcvkgk45z

> [C++][Docs] Describe limitations and alternatives for handling dependencies 
> via package managers
> 
>
> Key: ARROW-17347
> URL: https://issues.apache.org/jira/browse/ARROW-17347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Priority: Major
>
> In this page: [https://arrow.apache.org/docs/developers/cpp/building.html] it 
> is described how package managers can be used to get dependencies for Arrow. 
> My specific experience is with Apple Silicon on macOS, so I can describe my 
> experience there. A Brewfile is provided to assist getting necessary packages 
> to build Arrow, but it does not include all relevant packages to build the 
> most Arrow features possible on Mac (i.e., everything but CUDA). Diving 
> further, I learned that Homebrew cannot provide all necessary dependencies 
> for features such as GCS support, and had to turn to conda-forge. Upon doing 
> so, I started having overlapping dependencies elsewhere, and eventually had 
> to turn fully to conda.
> It would be helpful to have the limitations laid out for what features can be 
> built with what package managers, as well as adding conda as an alternative 
> to Homebrew for macOS users, since that is necessary to build the fullest 
> Arrow possible without building other libraries from source as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE

2022-08-09 Thread Todd Farmer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-17338:
---

Assignee: Todd Farmer

> [Java] The maximum request memory of BaseVariableWidthVector should limit to 
> Interger.MAX_VALUE
> ---
>
> Key: ARROW-17338
> URL: https://issues.apache.org/jira/browse/ARROW-17338
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Xianyang Liu
>Assignee: Todd Farmer
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We got a IndexOutOfBoundsException:
> ```
> 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: 
> Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): 
> java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 
> (expected: range(0, 2147483648))
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
>   at 
> org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
>   at 
> org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
>   at 
> org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
>   at 
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> ```
> The root cause is the following code of `BaseVariableWidthVector.handleSafe` 
> could fail to relocated because of int overflow and then led to 
> `IndexOutOfBoundsException` when we put the data into the vector.
> ```java
>   protected final void handleSafe(int index, int dataLength) {
> while (index >= getValueCapacity()) {
>   reallocValidityAndOffsetBuffers();
> }
> final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
> // startOffset + dataLength could overflow
> while (valueBuffer.capacity() < (startOffset + dataLength)) {
>   reallocDataBuffer();
> }
>   }
> ```
> The offset width of `BaseVariableWidthVector` is 4, while the maximum memory 
> allocation is Long.MAX_VALUE. This makes the memory allocation check invalid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17327:
---
Labels: pull-request-available  (was: )

> [Python] Parquet should be listed in PyArrow's get_libraries() function
> ---
>
> Key: ARROW-17327
> URL: https://issues.apache.org/jira/browse/ARROW-17327
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Steven Silvester
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following 
> [failure| 
> https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true]
>  when building wheels:  "@rpath/libparquet.800.dylib not found".
> We overcame the error by explicitly adding "parquet" to the list of libraries 
> returned by {{get_libraries}}.  I am happy to submit a PR.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17351) [C++][Compute] Support to initialize expression with a string

2022-08-09 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577503#comment-17577503
 ] 

Weston Pace commented on ARROW-17351:
-

I think the first (more verbose) option is preferred because it will be more 
generic.

However, if the first option is working, the second option can always be added 
later as an optional shortcut (and then support both).

> [C++][Compute] Support to initialize expression with a string
> -
>
> Key: ARROW-17351
> URL: https://issues.apache.org/jira/browse/ARROW-17351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.1
>Reporter: LinGeLin
>Priority: Major
>
> I want to achieve such a function, first would like to ask you which way to 
> achieve better
>  
> For example, I want to initialize an expression whose content is
> a210 - (a210 /203) * 203 = 0
> This means that column A210 modulo 203 is equal to 0
>  
> How do you compare these two ideas?
>  
> "(subtract(a210, multiply(divide(a210, 203), 203)) == 0)" to Expression
> or
> "a210-(a210/203)*203==0" to Expression



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577524#comment-17577524
 ] 

Joris Van den Bossche commented on ARROW-10739:
---

[~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far 
as I know, nobody is actively working on this, so a PR is certainly welcome. I 
think option (1) is a good path forward.



> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577524#comment-17577524
 ] 

Joris Van den Bossche edited comment on ARROW-10739 at 8/9/22 4:40 PM:
---

[~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far 
as I know, nobody is actively working on this, so a PR is certainly welcome! I 
think option (1) is a good path forward.




was (Author: jorisvandenbossche):
[~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far 
as I know, nobody is actively working on this, so a PR is certainly welcome. I 
think option (1) is a good path forward.



> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577525#comment-17577525
 ] 

Joris Van den Bossche commented on ARROW-17327:
---

Out of curiosity, do you have an idea why this started to fail with pyarrow 
8.0? (I am not aware of something we have changed regarding this, and pyarrow 
has been built against a parquet library for a long time) 

> [Python] Parquet should be listed in PyArrow's get_libraries() function
> ---
>
> Key: ARROW-17327
> URL: https://issues.apache.org/jira/browse/ARROW-17327
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Steven Silvester
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following 
> [failure| 
> https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true]
>  when building wheels:  "@rpath/libparquet.800.dylib not found".
> We overcame the error by explicitly adding "parquet" to the list of libraries 
> returned by {{get_libraries}}.  I am happy to submit a PR.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577531#comment-17577531
 ] 

Joris Van den Bossche commented on ARROW-17327:
---

Ah, actually, pyarrow itself has always been built against libparquet (and 
included this in the wheels), but for the arrow_python library itself this 
dependency was indeed introduced in 8.0 with ARROW-9947.



> [Python] Parquet should be listed in PyArrow's get_libraries() function
> ---
>
> Key: ARROW-17327
> URL: https://issues.apache.org/jira/browse/ARROW-17327
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Steven Silvester
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following 
> [failure| 
> https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true]
>  when building wheels:  "@rpath/libparquet.800.dylib not found".
> We overcame the error by explicitly adding "parquet" to the list of libraries 
> returned by {{get_libraries}}.  I am happy to submit a PR.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0

2022-08-09 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones resolved ARROW-17352.

Resolution: Not A Problem

> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0
> -
>
> Key: ARROW-17352
> URL: https://issues.apache.org/jira/browse/ARROW-17352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
> Environment: Windows10
>Reporter: Oliver Klein
>Priority: Critical
> Attachments: arrow9error.PNG
>
>
> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0. It worked when stored with version 8 and earlier.
> Windows Parquet Viewer: 2.3.5 and 2.3.6
> pyarrow version: 9.0.0
> Error: System.AggregateException: One or more errors occured. ---> 
> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 
> at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
> DataColumnReader.cs: line 259
>  
> After further checking I found that it seems the problem seems to relate to a 
> default parquet version change.
> When I use pyarrow 9 and configure version to 1.0 it works again from the 
> windows tool - when its 2.4 its not working (or supported in the windows 
> tool).
> df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
> df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')
> Question might be if such a default change is a bug or a feature.
> Finally found: 
>  * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280)
> So probably its a feature - and we need to adapt our code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17352) Parquet files cannot be opened in Windows Parquet Viewer when stored with Arrow Version 9.0.0

2022-08-09 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577537#comment-17577537
 ] 

Will Jones commented on ARROW-17352:


Yes, it's intentional that we increased the default version of Parquet. 
Hopefully Windows Parquet Viewer will add support for the more recent version 
soon.

> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0
> -
>
> Key: ARROW-17352
> URL: https://issues.apache.org/jira/browse/ARROW-17352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
> Environment: Windows10
>Reporter: Oliver Klein
>Priority: Critical
> Attachments: arrow9error.PNG
>
>
> Parquet files cannot be opened in Windows Parquet Viewer when stored with 
> Arrow Version 9.0.0. It worked when stored with version 8 and earlier.
> Windows Parquet Viewer: 2.3.5 and 2.3.6
> pyarrow version: 9.0.0
> Error: System.AggregateException: One or more errors occured. ---> 
> Parquet.ParquetException: encoding RLE_DICTIONARY is not supported. 
> at Parquet.File.DataColumnReader.ReadColumn(BinaryReader reader ... in 
> DataColumnReader.cs: line 259
>  
> After further checking I found that it seems the problem seems to relate to a 
> default parquet version change.
> When I use pyarrow 9 and configure version to 1.0 it works again from the 
> windows tool - when its 2.4 its not working (or supported in the windows 
> tool).
> df.to_parquet(r'C:\temp\test_10.parquet', version='1.0')
> df.to_parquet(r'C:\temp\test_24.parquet', version='2.4')
> Question might be if such a default change is a bug or a feature.
> Finally found: 
>  * ARROW-12203 - [C++][Python] Switch default Parquet version to 2.4 (#13280)
> So probably its a feature - and we need to adapt our code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE

2022-08-09 Thread Todd Farmer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-17338:
---

Assignee: Xianyang Liu  (was: Todd Farmer)

> [Java] The maximum request memory of BaseVariableWidthVector should limit to 
> Interger.MAX_VALUE
> ---
>
> Key: ARROW-17338
> URL: https://issues.apache.org/jira/browse/ARROW-17338
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We got a IndexOutOfBoundsException:
> ```
> 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: 
> Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): 
> java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 
> (expected: range(0, 2147483648))
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
>   at 
> org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
>   at 
> org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
>   at 
> org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
>   at 
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> ```
> The root cause is the following code of `BaseVariableWidthVector.handleSafe` 
> could fail to relocated because of int overflow and then led to 
> `IndexOutOfBoundsException` when we put the data into the vector.
> ```java
>   protected final void handleSafe(int index, int dataLength) {
> while (index >= getValueCapacity()) {
>   reallocValidityAndOffsetBuffers();
> }
> final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
> // startOffset + dataLength could overflow
> while (valueBuffer.capacity() < (startOffset + dataLength)) {
>   reallocDataBuffer();
> }
>   }
> ```
> The offset width of `BaseVariableWidthVector` is 4, while the maximum memory 
> allocation is Long.MAX_VALUE. This makes the memory allocation check invalid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17335) [Python] Type checking support

2022-08-09 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577564#comment-17577564
 ] 

Joris Van den Bossche commented on ARROW-17335:
---

Mypy doesn't use pyi files when eg doing `mypy pyarrow`?

> [Python] Type checking support
> --
>
> Key: ARROW-17335
> URL: https://issues.apache.org/jira/browse/ARROW-17335
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Jorrick Sleijster
>Priority: Major
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> h1. mypy and static type checking
> As of Python3.6, it has been possible to introduce typing information in the 
> code. This became immensely popular in a short period of time. Shortly after, 
> the tool `mypy` arrived and this has become the industry standard for static 
> type checking inside Python. It is able to check very quickly for invalid 
> types which makes it possible to serve as a pre-commit. It has raised many 
> bugs that I did not see myself and has been a very valuable tool.
> h2. Now what does this mean for PyArrow?
> When we run mypy on code that uses PyArrow, you will get error message as 
> follows:
> ```
> some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": 
> module is installed, but missing library stubs or py.typed marker
> some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing 
> "pyarrow.fs": module is installed, but missing library stubs or py.typed 
> marker
> ```
> More information is available here: 
> [https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker]
> h2. You can solve this in three ways:
>  # Ignore the message. This, however, will put all types from PyArrow to 
> `Any`, making it unable to find user errors with the PyArrow library
>  # Create a Python stub file. This is what previously used to be the 
> standard, however, it no longer a popular option. This is because stubs are 
> extra, next to the source code, while you can also inline the code with type 
> hints, which brings me to our third option.
>  # Create a `py.typed` file and use inline type hints. This is the most 
> popular option today because it requires no extra files (except for the 
> py.typed file), allows all the type hints to be with the code (like now in 
> the documentation) and not only provides your users but also the developers 
> of the library themselves with type hints (and hinting of issues inside your 
> IDE).
>  
> My personal opinion already shines through the options, it is 3 as this has 
> shortly become the industry standard since the introduction.
> h2. What should we do?
> I'd very much like to work on this, however, I don't feel like wasting time. 
> Therefore, I am raising this ticket to see if this had been considered before 
> or if we just didn't get to this yet.
> I'd like to open the discussion here:
>  # Do you agree with number #3 as type hints.
>  # Should we remove the documentation annotations for the type hints given 
> they will be inside the functions? Or should we keep it and specify it in the 
> code? Which would make it double.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17338) [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE

2022-08-09 Thread Todd Farmer (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577566#comment-17577566
 ] 

Todd Farmer commented on ARROW-17338:
-

[~coneyliu] : Thank you for the bug report and the pull request! Your account 
has been given the "contributor" role, which allows ARROW issues - such as this 
- to be assigned to you. I've assigned this issue to you to reflect your 
already-made contributions - please assign the issue to me if you have any 
concern with that.

> [Java] The maximum request memory of BaseVariableWidthVector should limit to 
> Interger.MAX_VALUE
> ---
>
> Key: ARROW-17338
> URL: https://issues.apache.org/jira/browse/ARROW-17338
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We got a IndexOutOfBoundsException:
> ```
> 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: 
> Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): 
> java.lang.IndexOutOfBoundsException: index: 2147312542, length: 13 
> (expected: range(0, 2147483648))
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
>   at 
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
>   at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
>   at 
> org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
>   at 
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
>   at 
> org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
>   at 
> org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
>   at 
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> ```
> The root cause is the following code of `BaseVariableWidthVector.handleSafe` 
> could fail to relocated because of int overflow and then led to 
> `IndexOutOfBoundsException` when we put the data into the vector.
> ```java
>   protected final void handleSafe(int index, int dataLength) {
> while (index >= getValueCapacity()) {
>   reallocValidityAndOffsetBuffers();
> }
> final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
> // startOffset + dataLength could overflow
> while (valueBuffer.capacity() < (startOffset + dataLength)) {
>   reallocDataBuffer();
> }
>   }
> ```
> The offset width of `BaseVariableWidthVector` is 4, while the maximum memory 
> allocation is Long.MA

[jira] [Commented] (ARROW-17169) [Go] goPanicIndex in firstTimeBitmapWriter.Finish()

2022-08-09 Thread Matthew Topol (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577569#comment-17577569
 ] 

Matthew Topol commented on ARROW-17169:
---

[~Purdom] any updates here?

> [Go] goPanicIndex in firstTimeBitmapWriter.Finish()
> ---
>
> Key: ARROW-17169
> URL: https://issues.apache.org/jira/browse/ARROW-17169
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 9.0.0, 8.0.1
> Environment: go (1.18.3), Linux, AMD64
>Reporter: Robert Purdom
>Priority: Critical
>
> I'm working with complex parquet files with 500+ "root" columns where some 
> fields are lists of structs, internally referred to as 'topics'.  Some of 
> these structs have 100's of columns.  When reading a particular topic, I get 
> an Index Panic at the line indicated below. This error occurs when the value 
> for the topic is Null, as in, for this particular root record, this topic has 
> no data.  The root is household data, the topic is auto, so the error occurs 
> when the household has no autos.  The auto field is a Nullable List of Struct.
>  
> {code:go}
> /* Finish() was called from defLevelsToBitmapInternal.
> data values when panic occurs
> bw.length == 17531
> bw.bitMask == 1
> bw.pos == 3424
> bw.length == 17531
> len(bw.Buf) == 428
> cap(bw.Buf) == 448
> bw.byteOffset == 428
> bw.curByte == 0
> */
> // bitmap_writer.go
> func (bw *firstTimeBitmapWriter) Finish() {
> // store curByte into the bitmap
>  if bw.length >0&& bw.bitMask !=0x01|| bw.pos < bw.length {
>   bw.buf[int(bw.byteOffset)] = bw.curByte   // < Panic index
>  }
> }
> {code}
> In every case, when the panic occurs, bw.byteOffset == len(bw.Buf). I tested 
> the below modification and it does remedy the bug. However, it's probably 
> only masking the actual bug.
> {code:go}
> // Test version: No Panic
> func (bw *firstTimeBitmapWriter) Finish() {
>   // store curByte into the bitmap
>   if bw.length > 0 && bw.bitMask != 0x01 || bw.pos < bw.length {
> if int(bw.byteOffset) == len(bw.Buf) {
>  bw.buf = append(bw.buf, bw.curByte)
> } else {
>bw.buf[int(bw.byteOffset)] = bw.curByte
>}
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go

2022-08-09 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-17326:
--
Component/s: FlightRPC
 Go
 SQL

> [Go][FlightSQL] Add Support for FlightSQL to Go
> ---
>
> Key: ARROW-17326
> URL: https://issues.apache.org/jira/browse/ARROW-17326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Go, SQL
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>
> Also addresses https://github.com/apache/arrow/issues/12496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go

2022-08-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17326:
---
Labels: pull-request-available  (was: )

> [Go][FlightSQL] Add Support for FlightSQL to Go
> ---
>
> Key: ARROW-17326
> URL: https://issues.apache.org/jira/browse/ARROW-17326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Go, SQL
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also addresses https://github.com/apache/arrow/issues/12496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC

2022-08-09 Thread Kae Suarez (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577573#comment-17577573
 ] 

Kae Suarez commented on ARROW-17344:


Unfortunately, recompiling with ORC on just worked out. Maybe it just grabbed 
the include path correctly this time. I'll switch the status until someone can 
reproduce this.

> [C++] ORC in build breaks code using Arrow and not ORC
> --
>
> Key: ARROW-17344
> URL: https://issues.apache.org/jira/browse/ARROW-17344
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kae Suarez
>Priority: Major
>
> After building Arrow from source, with ORC enabled and retrieved from a 
> package manager, I went to build some software that did not use ORC, and was 
> unable to compile because ORC was not found by Arrow internals, and I didn't 
> include ORC in my final CMake file (the one for my Arrow-using program), 
> because I simply wasn't using it.
> I rebuilt Arrow without ORC for now, but the ability to include ORC as a 
> feature in Arrow and not have to include it in future CMakes when ORC isn't 
> in use would be nice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC

2022-08-09 Thread Kae Suarez (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez updated ARROW-17344:
---
Priority: Trivial  (was: Major)

> [C++] ORC in build breaks code using Arrow and not ORC
> --
>
> Key: ARROW-17344
> URL: https://issues.apache.org/jira/browse/ARROW-17344
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kae Suarez
>Priority: Trivial
>
> After building Arrow from source, with ORC enabled and retrieved from a 
> package manager, I went to build some software that did not use ORC, and was 
> unable to compile because ORC was not found by Arrow internals, and I didn't 
> include ORC in my final CMake file (the one for my Arrow-using program), 
> because I simply wasn't using it.
> I rebuilt Arrow without ORC for now, but the ability to include ORC as a 
> feature in Arrow and not have to include it in future CMakes when ORC isn't 
> in use would be nice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17344) [C++] ORC in build breaks code using Arrow and not ORC -- not reproducible

2022-08-09 Thread Kae Suarez (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez updated ARROW-17344:
---
Summary: [C++] ORC in build breaks code using Arrow and not ORC -- not 
reproducible  (was: [C++] ORC in build breaks code using Arrow and not ORC)

> [C++] ORC in build breaks code using Arrow and not ORC -- not reproducible
> --
>
> Key: ARROW-17344
> URL: https://issues.apache.org/jira/browse/ARROW-17344
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kae Suarez
>Priority: Trivial
>
> After building Arrow from source, with ORC enabled and retrieved from a 
> package manager, I went to build some software that did not use ORC, and was 
> unable to compile because ORC was not found by Arrow internals, and I didn't 
> include ORC in my final CMake file (the one for my Arrow-using program), 
> because I simply wasn't using it.
> I rebuilt Arrow without ORC for now, but the ability to include ORC as a 
> feature in Arrow and not have to include it in future CMakes when ORC isn't 
> in use would be nice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17293) [Java][CI] Prune java nightly builds

2022-08-09 Thread David Dali Susanibar Arce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Dali Susanibar Arce reassigned ARROW-17293:
-

Assignee: David Dali Susanibar Arce

> [Java][CI] Prune java nightly builds
> 
>
> Key: ARROW-17293
> URL: https://issues.apache.org/jira/browse/ARROW-17293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Java
>Reporter: Jacob Wujciak-Jens
>Assignee: David Dali Susanibar Arce
>Priority: Critical
>
> Currently we are accumulating a huge number of nightly java jars. We should 
> prune them to keep max. 14 versions around. (see r_nightly.yml)
> It might also be nice to always rename/copy the most recent jars to something 
> fixed so there is no need to update your local config to always have the 
> newest version?  (but up to the java users if this is necessary/worth it).
>  
> cc [~dsusanibara] [~ljw1001] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17359) [Go][FlightSQL] Create SQLite example

2022-08-09 Thread Matthew Topol (Jira)

Matthew Topol created ARROW-17359:
-

 Summary: [Go][FlightSQL] Create SQLite example
 Key: ARROW-17359
 URL: https://issues.apache.org/jira/browse/ARROW-17359
 Project: Apache Arrow
  Issue Type: New Feature
  Components: FlightRPC, Go, SQL
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

2022-08-09 Thread Matthew Roeschke (Jira)

Matthew Roeschke created ARROW-17360:


 Summary: [Python] pyarrow.orc.ORCFile.read does not preserve 
ordering of columns
 Key: ARROW-17360
 URL: https://issues.apache.org/jira/browse/ARROW-17360
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 8.0.1
Reporter: Matthew Roeschke


xref [https://github.com/pandas-dev/pandas/issues/47944]

 
{code:java}
In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas main branch / 1.5
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

# reordered to a, b
In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string

a: [[1,2,3]]
b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17170) [C++][Docs] Research Documentation Formats

2022-08-09 Thread Kae Suarez (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez closed ARROW-17170.
--
Resolution: Done

We have plenty of sources, and are moving forward with the Getting Started page 
currently.

> [C++][Docs] Research Documentation Formats
> --
>
> Key: ARROW-17170
> URL: https://issues.apache.org/jira/browse/ARROW-17170
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Assignee: Kae Suarez
>Priority: Major
>
> In order to revise the documentation, some inspiration is needed to get the 
> format right. This ticket provides a space for exploration of possible 
> inspiration for the C++ documentation – once we have some good examples 
> and/or agreement, we can move to some content creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17361) dplyr::summarize fails with division when divisor is a variable

2022-08-09 Thread Oliver Reiter (Jira)

Oliver Reiter created ARROW-17361:
-

 Summary: dplyr::summarize fails with division when divisor is a 
variable
 Key: ARROW-17361
 URL: https://issues.apache.org/jira/browse/ARROW-17361
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 8.0.0
Reporter: Oliver Reiter


Hello,

I found this odd behaviour when trying to compute an aggregate with 
dplyr::summarize: When I want to use a pre-defined variable to do a divison 
while aggregating, the execution fails with 'unsupported expression'. When I 
the value of the variable as is in the aggregation, it works.

 

See below:

 
{code:java}
library(dplyr)
library(arrow)

small_dataset <- tibble::tibble(
  ## x = rep(c("a", "b"), each = 5),
  y = rep(1:5, 2)
)

## convert "small_dataset" into a ...dataset
tmpdir <- tempfile()
dir.create(tmpdir)
write_dataset(small_dataset, tmpdir)

## works
open_dataset(tmpdir) %>%
  summarize(value = sum(y) / 10) %>%
  collect()

## fails
scale_factor <- 10
open_dataset(tmpdir) %>%
  summarize(value = sum(y) / scale_factor) %>%
  collect()
#> Fehler: Error in summarize_eval(names(exprs)[i],
#> exprs[[i]], ctx, length(.data$group_by_vars) > :
#   Expression sum(y)/scale_factor is not an aggregate
#   expression or is not supported in Arrow
# Call collect() first to pull data into R.
   {code}
I was not sure how to name this issue/bug (if it is one), so if there is a 
clearer, more descriptive title you're welcome to adjust.

 

Thanks for your work!

 

Oliver

 
{code:java}
> arrow_info()
Arrow package version: 8.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator jemalloc
Current   64 bytes
Max       41.25 Kb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                           
C++ Library Version   8.0.0
C++ Compiler            GNU
C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577621#comment-17577621
 ] 

Nicola Crane commented on ARROW-15943:
--

I'm not sure.

>From an R perspective, if it's an option, I think it would be fine to support 
>passing in a list of filenames but still being able to use the directory names 
>as dataset variables, if that's possible (as R users are likely to be 
>comfortable pre-filtering the list of files).  This feels like it would fit 
>with option 3; I am currently working on ARROW-15260 which would allow users 
>to add the fragment filename as a column, which they could then use to filter 
>on (though I recall in a conversation on that PR or ticket, you mentioning 
>that we can't properly do pushdown filtering yet using that?)  However, you 
>mention the issue of loading the unwanted data into memory - I guess for these 
>users they might choose to use something other than arrow if this was 
>acceptable to them.

Option 1 sounds good too.

I don't fully understand what option 2 would look like, but if it's something 
we could wrap in R to achieve solutions to the 2 linked Stack Overflow 
questions, then great.  

Ultimately, I don't think there's an obvious best approach here, and that 
solving for the simplest case ("I have directories containing files, which I 
wish to both selectively load in some files from, but also use the directory 
structure to create variables") will get us most of the way there unless any 
super-specialist use cases emerge later.  Option 1 sounds potentially simplest?




> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-09 Thread Yaron Gvili (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577629#comment-17577629
 ] 

Yaron Gvili commented on ARROW-17290:
-

{quote}I'm curious, what is the use case?
{quote}
See [this post|http://example.com].

> [C++] Add order-comparisons for numeric scalars
> ---
>
> Key: ARROW-17290
> URL: https://issues.apache.org/jira/browse/ARROW-17290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, only equal-comparison of scalars are supported, by 
> `EqualComparable`. This issue will add order-comparisons, such as less-than, 
> to numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-09 Thread Yaron Gvili (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577629#comment-17577629
 ] 

Yaron Gvili edited comment on ARROW-17290 at 8/9/22 9:16 PM:
-

{quote}I'm curious, what is the use case?
{quote}
See [this 
post|https://github.com/apache/arrow/pull/13784#issuecomment-1209861142].


was (Author: JIRAUSER284707):
{quote}I'm curious, what is the use case?
{quote}
See [this post|http://example.com].

> [C++] Add order-comparisons for numeric scalars
> ---
>
> Key: ARROW-17290
> URL: https://issues.apache.org/jira/browse/ARROW-17290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, only equal-comparison of scalars are supported, by 
> `EqualComparable`. This issue will add order-comparisons, such as less-than, 
> to numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-11699) [R] Implement dplyr::across() inside mutate()

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-11699:
-
Summary: [R] Implement dplyr::across() inside mutate()  (was: [R] Implement 
dplyr::across())

> [R] Implement dplyr::across() inside mutate()
> -
>
> Key: ARROW-11699
> URL: https://issues.apache.org/jira/browse/ARROW-11699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It's not a generic, but because it seems only to be called inside of 
> functions like `mutate()`, we can insert our own version of it into the NSE 
> data mask



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17362) [R] Implement dplyr::across() inside summarise()

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17362:


 Summary: [R] Implement dplyr::across() inside summarise()
 Key: ARROW-17362
 URL: https://issues.apache.org/jira/browse/ARROW-17362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
Once this is merged, we should also add the ability to do so within 
dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16017) [C++] Benchmark key_hash and document tradeoffs with vendored xxhash

2022-08-09 Thread Aldrin Montana (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577635#comment-17577635
 ] 

Aldrin Montana commented on ARROW-16017:


Linking to ARROW-8991, where the PR starts to add some of this benchmarking

> [C++] Benchmark key_hash and document tradeoffs with vendored xxhash
> 
>
> Key: ARROW-16017
> URL: https://issues.apache.org/jira/browse/ARROW-16017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> ARROW-15239 adds a vectorized hashing function for use initially in execution 
> engine bloom filters and later in the hash-join node.  We should add some 
> benchmarks to explore how the performance compares to the vendored scalar 
> xxhash implementation.  In addition, we should document where the two differ 
> and explain any tradeoffs for future users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577637#comment-17577637
 ] 

Weston Pace commented on ARROW-15943:
-

I agree that option 1 is the simplest and probably preferred option.

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-08-09 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15943:

Labels: dataset good-second-issue  (was: dataset)

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dataset, good-second-issue
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17363) [C++][Compute] Add Cryptographic hash functions to Acero

2022-08-09 Thread Aldrin Montana (Jira)

Aldrin Montana created ARROW-17363:
--

 Summary: [C++][Compute] Add Cryptographic hash functions to Acero
 Key: ARROW-17363
 URL: https://issues.apache.org/jira/browse/ARROW-17363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Aldrin Montana


We would like to add cryptographic hash function kernels to Acero (MD5, SHA1, 
etc.).

At this time, there are no particular cryptographic functions we want to start 
adding, but the only hash functions available seem to be variants or 
specializations of xxHash, which is not appropriate for cryptographic use cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17363) [C++][Compute] Add Cryptographic hash functions to Acero

2022-08-09 Thread Aldrin Montana (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577641#comment-17577641
 ] 

Aldrin Montana commented on ARROW-17363:


Linking to ARROW-8991, which adds some hashing functions to the compute API

> [C++][Compute] Add Cryptographic hash functions to Acero
> 
>
> Key: ARROW-17363
> URL: https://issues.apache.org/jira/browse/ARROW-17363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Aldrin Montana
>Priority: Minor
>  Labels: C++, compute, cryptography, query-engine
>
> We would like to add cryptographic hash function kernels to Acero (MD5, SHA1, 
> etc.).
> At this time, there are no particular cryptographic functions we want to 
> start adding, but the only hash functions available seem to be variants or 
> specializations of xxHash, which is not appropriate for cryptographic use 
> cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17364:


 Summary: [R] Implement .names argument inside across()
 Key: ARROW-17364
 URL: https://issues.apache.org/jira/browse/ARROW-17364
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{.names}} argument is not  yet supported but should be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17365:


 Summary: [R] Implement ... argument inside across()
 Key: ARROW-17365
 URL: https://issues.apache.org/jira/browse/ARROW-17365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{.names}} argument is not  yet supported but should be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17365:
-
Description: 
ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{...}} argument is not  yet supported but should be added.


  was:ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  
The {{.names}} argument is not  yet supported but should be added.


> [R] Implement ... argument inside across()
> --
>
> Key: ARROW-17365
> URL: https://issues.apache.org/jira/browse/ARROW-17365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{...}} argument is not  yet supported but should be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17365:
-
Description: 
ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{...}} argument is not  yet supported but should be added.  There is a failing 
test in the PR for ARROW-11699 which references this JIRA.


  was:
ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{...}} argument is not  yet supported but should be added.



> [R] Implement ... argument inside across()
> --
>
> Key: ARROW-17365
> URL: https://issues.apache.org/jira/browse/ARROW-17365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{...}} argument is not  yet supported but should be added.  There is a 
> failing test in the PR for ARROW-11699 which references this JIRA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577648#comment-17577648
 ] 

Nicola Crane commented on ARROW-17365:
--

We should not implement this as the {{...}} argument [is 
deprecated|https://github.com/tidyverse/dplyr/blob/HEAD/R/across.R#L36]

> [R] Implement ... argument inside across()
> --
>
> Key: ARROW-17365
> URL: https://issues.apache.org/jira/browse/ARROW-17365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{...}} argument is not  yet supported but should be added.  There is a 
> failing test in the PR for ARROW-11699 which references this JIRA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-17365.

Resolution: Won't Fix

> [R] Implement ... argument inside across()
> --
>
> Key: ARROW-17365
> URL: https://issues.apache.org/jira/browse/ARROW-17365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{...}} argument is not  yet supported but should be added.  There is a 
> failing test in the PR for ARROW-11699 which references this JIRA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-09 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17364:
-
Description: 
ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{.names}} argument is not  yet supported but should be added. 

Additional tests looking at different ways of specifying the {{.fns}} argument 
should be re-enabled (see tests with this ticket number in their comments, and 
https://github.com/tidyverse/dplyr/issues/6395 for more context).

  was:ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  
The {{.names}} argument is not  yet supported but should be added.


> [R] Implement .names argument inside across()
> -
>
> Key: ARROW-17364
> URL: https://issues.apache.org/jira/browse/ARROW-17364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{.names}} argument is not  yet supported but should be added. 
> Additional tests looking at different ways of specifying the {{.fns}} 
> argument should be re-enabled (see tests with this ticket number in their 
> comments, and https://github.com/tidyverse/dplyr/issues/6395 for more 
> context).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()

2022-08-09 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17366:


 Summary: [R] Support purrr-style lambda functions in .fns argument 
to across()
 Key: ARROW-17366
 URL: https://issues.apache.org/jira/browse/ARROW-17366
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns argument 
does not yet support purrr-style lambda functions (e.g. {{~round(.x, digits = 
-1)}} but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 >

1 - 100 of 116 matches

Mail list logo