[jira] [Assigned] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-17079:


Assignee: Philipp Moritz

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17079:
-
Summary: [C++] Improve error message propagation from AWS SDK  (was: 
Improve error message propagation from AWS SDK)

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17079) Improve error message propagation from AWS SDK

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17079:
---
Labels: pull-request-available  (was: )

> Improve error message propagation from AWS SDK
> --
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16340) [C++][Python] Move all Python related code into PyArrow

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-16340:
-
Summary: [C++][Python] Move all Python related code into PyArrow  (was: 
[Python] Move all Python related code into PyArrow)

> [C++][Python] Move all Python related code into PyArrow
> ---
>
> Key: ARROW-16340
> URL: https://issues.apache.org/jira/browse/ARROW-16340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 32h 10m
>  Remaining Estimate: 0h
>
> Move {{src/arrow/python}} directory into {{pyarrow}} and arrange PyArrow to 
> build it.
> More details can be found on this thread:
> https://lists.apache.org/thread/jbxyldhqff4p9z53whhs95y4jcomdgd2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17535) [Python] List arrays aren't supported in to_pandas calls

2022-08-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-17535:
---

 Summary: [Python] List arrays aren't supported in 
to_pandas calls
 Key: ARROW-17535
 URL: https://issues.apache.org/jira/browse/ARROW-17535
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Micah Kornfield


EXTENSION is not in the list of types allowed.  I think in order to enable 
EXTENSION we need to be able to call to_pylist or similar on the original 
extension array from C++ code, in case there were user provided overrides.  Off 
the top of my head one way of doing this would be to pass through an additional 
std::unorderd_map where PyObject is the bound to_pylist 
python function.  Are there other alternative that might be cleaner?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17534) [C++] Support optional arguments in aggregation function mapping in the Substrait consumer.

2022-08-25 Thread Vibhatha Lakmal Abeykoon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-17534:


Assignee: Vibhatha Lakmal Abeykoon

> [C++] Support optional arguments in aggregation function mapping in the 
> Substrait consumer.
> ---
>
> Key: ARROW-17534
> URL: https://issues.apache.org/jira/browse/ARROW-17534
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: substrait
>
> It appears that {{sum}} and {{avg}} have an optional enum argument to specify 
> overflow behavior.  I'm not certain if I just missed this or if it is new.  
> Either way the current function mapping does not account for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17519) [R] RTools35 job is failing

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585073#comment-17585073
 ] 

Kouhei Sutou commented on ARROW-17519:
--

Thanks.
It seems that the referred discussion 
https://lists.apache.org/thread/9g14n3odhj6kzsgjxr6k6d3q73hg2njr from your link 
includes R 3.5 on Windows:

{quote}
It is reasonable to drop support for R < 4.0 on Windows, as 
suggested in this JIRA comment: 
https://issues.apache.org/jira/browse/ARROW-17110?focusedCommentId=17571472=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17571472
{quote}

> [R] RTools35 job is failing
> ---
>
> Key: ARROW-17519
> URL: https://issues.apache.org/jira/browse/ARROW-17519
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>
> After ARROW-17436, the RTools35 job is consistently failing with:
> {noformat}
> Error: Error: package or namespace load failed for 'arrow' in inDL(x, 
> as.logical(local), as.logical(now), ...):
>  unable to load shared object 
> 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll':
>   LoadLibrary failure:  A dynamic link library (DLL) initialization routine 
> failed.
> {noformat}
> Given that there is a mailing list discussion about dropping support for that 
> platform, should we disable the check? Or wait until that is resolved to 
> disable the check?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585071#comment-17585071
 ] 

Kouhei Sutou edited comment on ARROW-17531 at 8/26/22 12:36 AM:


Thanks.
Could you try the following instead?

{noformat}
install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/centos7/latest;)
{noformat}


was (Author: kou):
Thanks.
Could you try {{install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/centos7/latest;)}} instead?

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585071#comment-17585071
 ] 

Kouhei Sutou edited comment on ARROW-17531 at 8/26/22 12:36 AM:


Thanks.
Could you try {{install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/centos7/latest;)}} instead?


was (Author: kou):
Thanks.
Could you try {{install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/centos7/latest}} instead?

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585071#comment-17585071
 ] 

Kouhei Sutou commented on ARROW-17531:
--

Thanks.
Could you try {{install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/centos7/latest}} instead?

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585037#comment-17585037
 ] 

Kouhei Sutou edited comment on ARROW-17531 at 8/26/22 12:32 AM:


{noformat}
> sessionInfo()R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: 
/opt/intel/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2packrat_0.6.0 
{noformat}
Here's the session info


was (Author: JIRAUSER294961):
``
> sessionInfo()R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: 
/opt/intel/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2packrat_0.6.0 
```
Here's the session info

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17534) [C++] Support optional arguments in aggregation function mapping in the Substrait consumer.

2022-08-25 Thread Weston Pace (Jira)
Weston Pace created ARROW-17534:
---

 Summary: [C++] Support optional arguments in aggregation function 
mapping in the Substrait consumer.
 Key: ARROW-17534
 URL: https://issues.apache.org/jira/browse/ARROW-17534
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


It appears that {{sum}} and {{avg}} have an optional enum argument to specify 
overflow behavior.  I'm not certain if I just missed this or if it is new.  
Either way the current function mapping does not account for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16772) [C++] Implement encode and decode functions for Run-Length encoding

2022-08-25 Thread Aldrin Montana (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldrin Montana updated ARROW-16772:
---
Component/s: C++

> [C++] Implement encode and decode functions for Run-Length encoding
> ---
>
> Key: ARROW-16772
> URL: https://issues.apache.org/jira/browse/ARROW-16772
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Net Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585037#comment-17585037
 ] 

Net Zhang commented on ARROW-17531:
---

``
> sessionInfo()R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: 
/opt/intel/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2packrat_0.6.0 
```
Here's the session info

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585032#comment-17585032
 ] 

Kouhei Sutou commented on ARROW-17531:
--

Could you show your OS information?

> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> {noformat}
> The installation was successful but when I load the library I received error 
> message indicating 
> {noformat}
> /lib64/libm.so.6: version `GLIBC_2.27' not found
> {noformat}
> Here's my full log, containing machine information
> {noformat}
> > HTTPUserAgent =
> +     sprintf(
> +         "R/%s R (%s)",
> +         getRversion(),
> +         paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
> +     )
> > HTTPUserAgent
> [1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> > install.packages("arrow", repos = 
> > "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
> Installing package into 
> ‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
> (as ‘lib’ is unspecified)
> trying URL 
> 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
> Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
> ==
> downloaded 33.1 MB
> * installing *binary* package ‘arrow’ ...
> * DONE (arrow)
> The downloaded source packages are in
>     ‘/tmp/RtmpUfdX4s/downloaded_packages’
> > library(arrow)
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
>   /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
> /users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
> In addition: Warning message:
> package ‘arrow’ was built under R version 4.0.5 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17531:
-
Description: 
Hi, I've followed the [instructions 
|https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
install the arrow R package on a Linux machine. 

{noformat}

options(
  HTTPUserAgent =
    sprintf(
      "R/%s R (%s)",
      getRversion(),
      paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
    )
)

install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/focal/latest;)

{noformat}

The installation was successful but when I load the library I received error 
message indicating 

{noformat}

/lib64/libm.so.6: version `GLIBC_2.27' not found

{noformat}

Here's my full log, containing machine information

{noformat}

> HTTPUserAgent =
+     sprintf(
+         "R/%s R (%s)",
+         getRversion(),
+         paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
+     )
> HTTPUserAgent
[1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
Installing package into 
‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
trying URL 
'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
==
downloaded 33.1 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
    ‘/tmp/RtmpUfdX4s/downloaded_packages’
> library(arrow)
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
  /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
In addition: Warning message:
package ‘arrow’ was built under R version 4.0.5 

{noformat}

  was:
Hi, I've followed the [instructions 
|https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
install the arrow R package on a Linux machine. 

{noformat}

options(
  HTTPUserAgent =
    sprintf(
      "R/%s R (%s)",
      getRversion(),
      paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
    )
)

install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/focal/latest;)

```

The installation was successful but when I load the library I received error 
message indicating 

```

/lib64/libm.so.6: version `GLIBC_2.27' not found

```

Here's my full log, containing machine information

```

> HTTPUserAgent =
+     sprintf(
+         "R/%s R (%s)",
+         getRversion(),
+         paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
+     )
> HTTPUserAgent
[1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
Installing package into 
‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
trying URL 
'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
==
downloaded 33.1 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
    ‘/tmp/RtmpUfdX4s/downloaded_packages’
> library(arrow)
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
  /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
In addition: Warning message:
package ‘arrow’ was built under R version 4.0.5 

{noformat}


> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> 

[jira] [Updated] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17531:
-
Description: 
Hi, I've followed the [instructions 
|https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
install the arrow R package on a Linux machine. 

{noformat}

options(
  HTTPUserAgent =
    sprintf(
      "R/%s R (%s)",
      getRversion(),
      paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
    )
)

install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/focal/latest;)

```

The installation was successful but when I load the library I received error 
message indicating 

```

/lib64/libm.so.6: version `GLIBC_2.27' not found

```

Here's my full log, containing machine information

```

> HTTPUserAgent =
+     sprintf(
+         "R/%s R (%s)",
+         getRversion(),
+         paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
+     )
> HTTPUserAgent
[1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
Installing package into 
‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
trying URL 
'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
==
downloaded 33.1 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
    ‘/tmp/RtmpUfdX4s/downloaded_packages’
> library(arrow)
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
  /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
In addition: Warning message:
package ‘arrow’ was built under R version 4.0.5 

{noformat}

  was:
Hi, I've followed the [instructions 
|https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
install the arrow R package on a Linux machine. 

```

options(
  HTTPUserAgent =
    sprintf(
      "R/%s R (%s)",
      getRversion(),
      paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
    )
)

install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/focal/latest;)

```

The installation was successful but when I load the library I received error 
message indicating 

```

/lib64/libm.so.6: version `GLIBC_2.27' not found

```

Here's my full log, containing machine information

```

> HTTPUserAgent =
+     sprintf(
+         "R/%s R (%s)",
+         getRversion(),
+         paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
+     )
> HTTPUserAgent
[1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
Installing package into 
‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
trying URL 
'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
==
downloaded 33.1 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
    ‘/tmp/RtmpUfdX4s/downloaded_packages’
> library(arrow)
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
  /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
In addition: Warning message:
package ‘arrow’ was built under R version 4.0.5 

```


> /lib64/libm.so.6: version `GLIBC_2.27' not found
> 
>
> Key: ARROW-17531
> URL: https://issues.apache.org/jira/browse/ARROW-17531
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
>Reporter: Net Zhang
>Priority: Major
>
> Hi, I've followed the [instructions 
> |https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
> install the arrow R package on a Linux machine. 
> {noformat}
> options(
>   HTTPUserAgent =
>     sprintf(
>       "R/%s R (%s)",
>       getRversion(),
>       paste(getRversion(), R.version["platform"], R.version["arch"], 
> R.version["os"])
>     )
> )
> install.packages("arrow", repos = 
> 

[jira] [Created] (ARROW-17533) [R] Implement asof join

2022-08-25 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-17533:
--

 Summary: [R] Implement asof join
 Key: ARROW-17533
 URL: https://issues.apache.org/jira/browse/ARROW-17533
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


With ARROW-16083 we have asof joins, could we expose this in R?

Docs for the node: 
https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE

A possible syntax might be (there does not appear to be a syntax in dplyr for 
this already): 

{code}
asof_join(table1, table2, by = "field", tolerance = 1) 
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17533) [R] Implement asof join

2022-08-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585027#comment-17585027
 ] 

Jonathan Keane commented on ARROW-17533:


A bit more prior art | folks asking for: 
https://stackoverflow.com/questions/58538114/is-there-an-r-equivalent-of-pythons-pandas-merge-asof

> [R] Implement asof join
> ---
>
> Key: ARROW-17533
> URL: https://issues.apache.org/jira/browse/ARROW-17533
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> With ARROW-16083 we have asof joins, could we expose this in R?
> Docs for the node: 
> https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE
> A possible syntax might be (there does not appear to be a syntax in dplyr for 
> this already): 
> {code}
> asof_join(table1, table2, by = "field", tolerance = 1) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17532) [Go] Implement Numeric Cast functions

2022-08-25 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17532:
-

 Summary: [Go] Implement Numeric Cast functions
 Key: ARROW-17532
 URL: https://issues.apache.org/jira/browse/ARROW-17532
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17528) [R] Tidy up the pkgdown articles site index

2022-08-25 Thread Stephanie Hazlitt (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585023#comment-17585023
 ] 

Stephanie Hazlitt commented on ARROW-17528:
---

We could consider a few broad categories for the first level e.g. Developers 
(already there), Installation, Users.

> [R] Tidy up the pkgdown articles site index 
> 
>
> Key: ARROW-17528
> URL: https://issues.apache.org/jira/browse/ARROW-17528
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> We could better organise the different articles we have to make it easier for 
> users to find the right info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12711) [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()

2022-08-25 Thread Travis Lim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585019#comment-17585019
 ] 

Travis Lim commented on ARROW-12711:


[~icook] Any updates on bindings for dplyr summarise with paste(collapse) or 
str_c(collapse) in upcoming releases?

A potential workaround was floated for Python here 
https://issues.apache.org/jira/browse/ARROW-12710 but having this in R would be 
a game changer, especially for NLP applications :pray: :pray: :pray:

 

> [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()
> 
>
> Key: ARROW-12711
> URL: https://issues.apache.org/jira/browse/ARROW-12711
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> These are the aggregating versions of string concatenation—they combine 
> values from a set of rows into a single value. 
> The bindings for {{paste()}} and {{str_c()}} might be tricky to implement 
> because when these functions are called with the {{coallapse}} argument 
> unset, they do _not_ aggregate.
> In {{summarise()}} we need to be able to use scalar concatenation within 
> aggregate concatenation, like this: 
> {code:java}
> starwars %>%
>   filter(!is.na(hair_color) & !is.na(eye_color)) %>% 
>   group_by(homeworld) %>% 
>   summarise(hair_and_eyes = paste0(paste0(hair_color, "-haired and ", 
> eye_color, "-eyed"), collapse = ", ")){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17527) [Go] Implement Cast to Boolean Functions

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17527:
---
Labels: pull-request-available  (was: )

> [Go] Implement Cast to Boolean Functions
> 
>
> Key: ARROW-17527
> URL: https://issues.apache.org/jira/browse/ARROW-17527
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17530) [Java] VectorSchemaRoot#addVector() cannot add a vector to the end of the current vector collection

2022-08-25 Thread Larry White (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry White updated ARROW-17530:

Description: 
The current implementation of Java VectorSchemaRoot cannot add a vector at the 
end of the current list (which is the generally understood meaning of "add").

The Precondition check in the method's second line prevents providing an 
appropriate index for adding at the end:
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index < fieldVectors.size());
  List newVectors = new ArrayList<>();
  for (int i = 0; i < fieldVectors.size(); i++) {
if (i == index) {
  newVectors.add(vector);
}
newVectors.add(fieldVectors.get(i));
  }
  return new VectorSchemaRoot(newVectors);
}
 {code}
One possible implementation resolving the issue is shown below.
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index <= fieldVectors.size());
  List newVectors = new ArrayList<>();
  if (index == fieldVectors.size()) {
newVectors.addAll(fieldVectors);
newVectors.add(vector); 
  } else {
for (int i = 0; i < fieldVectors.size(); i++) {
  if (i == index) {
newVectors.add(vector);
  }
  newVectors.add(fieldVectors.get(i));
}
  }
  return new VectorSchemaRoot(newVectors);
}
{code}
 

 

 

 

 

  was:
The current implementation of Java VectorSchemaRoot cannot add a vector at the 
end of the current list (which is the generally understood meaning of "add").

The Precondition check in the method's second line prevents providing an 
appropriate index for adding at the end:

 
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index < fieldVectors.size());
  List newVectors = new ArrayList<>();
  for (int i = 0; i < fieldVectors.size(); i++) {
if (i == index) {
  newVectors.add(vector);
}
newVectors.add(fieldVectors.get(i));
  }
  return new VectorSchemaRoot(newVectors);
}
 {code}
 

 

One possible implementation resolving the issue is shown below.

 
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index <= fieldVectors.size());
  List newVectors = new ArrayList<>();
  if (index == fieldVectors.size()) {
newVectors.addAll(fieldVectors);
newVectors.add(vector); 
  } else {
for (int i = 0; i < fieldVectors.size(); i++) {
  if (i == index) {
newVectors.add(vector);
  }
  newVectors.add(fieldVectors.get(i));
}
  }
  return new VectorSchemaRoot(newVectors);
}
{code}
 

 

 

 

 


> [Java] VectorSchemaRoot#addVector() cannot add a vector to the end of the 
> current vector collection
> ---
>
> Key: ARROW-17530
> URL: https://issues.apache.org/jira/browse/ARROW-17530
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0, 9.0.1
>Reporter: Larry White
>Assignee: Larry White
>Priority: Major
>
> The current implementation of Java VectorSchemaRoot cannot add a vector at 
> the end of the current list (which is the generally understood meaning of 
> "add").
> The Precondition check in the method's second line prevents providing an 
> appropriate index for adding at the end:
> {code:java}
> public VectorSchemaRoot addVector(int index, FieldVector vector) {
>   Preconditions.checkNotNull(vector);
>   Preconditions.checkArgument(index >= 0 && index < fieldVectors.size());
>   List newVectors = new ArrayList<>();
>   for (int i = 0; i < fieldVectors.size(); i++) {
> if (i == index) {
>   newVectors.add(vector);
> }
> newVectors.add(fieldVectors.get(i));
>   }
>   return new VectorSchemaRoot(newVectors);
> }
>  {code}
> One possible implementation resolving the issue is shown below.
> {code:java}
> public VectorSchemaRoot addVector(int index, FieldVector vector) {
>   Preconditions.checkNotNull(vector);
>   Preconditions.checkArgument(index >= 0 && index <= fieldVectors.size());
>   List newVectors = new ArrayList<>();
>   if (index == fieldVectors.size()) {
> newVectors.addAll(fieldVectors);
> newVectors.add(vector); 
>   } else {
> for (int i = 0; i < fieldVectors.size(); i++) {
>   if (i == index) {
> newVectors.add(vector);
>   }
>   newVectors.add(fieldVectors.get(i));
> }
>   }
>   return new VectorSchemaRoot(newVectors);
> }
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17530) [Java] VectorSchemaRoot#addVector() cannot add a vector to the end of the current vector collection

2022-08-25 Thread Larry White (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry White updated ARROW-17530:

Issue Type: Bug  (was: Improvement)

> [Java] VectorSchemaRoot#addVector() cannot add a vector to the end of the 
> current vector collection
> ---
>
> Key: ARROW-17530
> URL: https://issues.apache.org/jira/browse/ARROW-17530
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0, 9.0.1
>Reporter: Larry White
>Assignee: Larry White
>Priority: Major
>
> The current implementation of Java VectorSchemaRoot cannot add a vector at 
> the end of the current list (which is the generally understood meaning of 
> "add").
> The Precondition check in the method's second line prevents providing an 
> appropriate index for adding at the end:
>  
> {code:java}
> public VectorSchemaRoot addVector(int index, FieldVector vector) {
>   Preconditions.checkNotNull(vector);
>   Preconditions.checkArgument(index >= 0 && index < fieldVectors.size());
>   List newVectors = new ArrayList<>();
>   for (int i = 0; i < fieldVectors.size(); i++) {
> if (i == index) {
>   newVectors.add(vector);
> }
> newVectors.add(fieldVectors.get(i));
>   }
>   return new VectorSchemaRoot(newVectors);
> }
>  {code}
>  
>  
> One possible implementation resolving the issue is shown below.
>  
> {code:java}
> public VectorSchemaRoot addVector(int index, FieldVector vector) {
>   Preconditions.checkNotNull(vector);
>   Preconditions.checkArgument(index >= 0 && index <= fieldVectors.size());
>   List newVectors = new ArrayList<>();
>   if (index == fieldVectors.size()) {
> newVectors.addAll(fieldVectors);
> newVectors.add(vector); 
>   } else {
> for (int i = 0; i < fieldVectors.size(); i++) {
>   if (i == index) {
> newVectors.add(vector);
>   }
>   newVectors.add(fieldVectors.get(i));
> }
>   }
>   return new VectorSchemaRoot(newVectors);
> }
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17262) [C++] Kernel input type matcher for RLE

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17262:
---
Labels: pull-request-available  (was: )

> [C++] Kernel input type matcher for RLE
> ---
>
> Key: ARROW-17262
> URL: https://issues.apache.org/jira/browse/ARROW-17262
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Builds on top of ARROW-17261



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17531) /lib64/libm.so.6: version `GLIBC_2.27' not found

2022-08-25 Thread Net Zhang (Jira)
Net Zhang created ARROW-17531:
-

 Summary: /lib64/libm.so.6: version `GLIBC_2.27' not found
 Key: ARROW-17531
 URL: https://issues.apache.org/jira/browse/ARROW-17531
 Project: Apache Arrow
  Issue Type: Bug
 Environment: R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)
Reporter: Net Zhang


Hi, I've followed the [instructions 
|https://cran.r-project.org/web/packages/arrow/vignettes/install.html]to 
install the arrow R package on a Linux machine. 

```

options(
  HTTPUserAgent =
    sprintf(
      "R/%s R (%s)",
      getRversion(),
      paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
    )
)

install.packages("arrow", repos = 
"https://packagemanager.rstudio.com/all/__linux__/focal/latest;)

```

The installation was successful but when I load the library I received error 
message indicating 

```

/lib64/libm.so.6: version `GLIBC_2.27' not found

```

Here's my full log, containing machine information

```

> HTTPUserAgent =
+     sprintf(
+         "R/%s R (%s)",
+         getRversion(),
+         paste(getRversion(), R.version["platform"], R.version["arch"], 
R.version["os"])
+     )
> HTTPUserAgent
[1] "R/4.0.2 R (4.0.2 x86_64-pc-linux-gnu x86_64 linux-gnu)"
> install.packages("arrow", repos = 
> "https://packagemanager.rstudio.com/all/__linux__/focal/latest;)
Installing package into 
‘/users/PZS1008/netzhang/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
trying URL 
'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_9.0.0.tar.gz'
Content type 'binary/octet-stream' length 34655538 bytes (33.1 MB)
==
downloaded 33.1 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
    ‘/tmp/RtmpUfdX4s/downloaded_packages’
> library(arrow)
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so':
  /lib64/libm.so.6: version `GLIBC_2.27' not found (required by 
/users/xx/R/x86_64-pc-linux-gnu-library/4.0/arrow/libs/arrow.so)
In addition: Warning message:
package ‘arrow’ was built under R version 4.0.5 

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17530) [Java] VectorSchemaRoot#addVector() cannot add a vector to the end of the current vector collection

2022-08-25 Thread Larry White (Jira)
Larry White created ARROW-17530:
---

 Summary: [Java] VectorSchemaRoot#addVector() cannot add a vector 
to the end of the current vector collection
 Key: ARROW-17530
 URL: https://issues.apache.org/jira/browse/ARROW-17530
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 9.0.0, 9.0.1
Reporter: Larry White
Assignee: Larry White


The current implementation of Java VectorSchemaRoot cannot add a vector at the 
end of the current list (which is the generally understood meaning of "add").

The Precondition check in the method's second line prevents providing an 
appropriate index for adding at the end:

 
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index < fieldVectors.size());
  List newVectors = new ArrayList<>();
  for (int i = 0; i < fieldVectors.size(); i++) {
if (i == index) {
  newVectors.add(vector);
}
newVectors.add(fieldVectors.get(i));
  }
  return new VectorSchemaRoot(newVectors);
}
 {code}
 

 

One possible implementation resolving the issue is shown below.

 
{code:java}
public VectorSchemaRoot addVector(int index, FieldVector vector) {
  Preconditions.checkNotNull(vector);
  Preconditions.checkArgument(index >= 0 && index <= fieldVectors.size());
  List newVectors = new ArrayList<>();
  if (index == fieldVectors.size()) {
newVectors.addAll(fieldVectors);
newVectors.add(vector); 
  } else {
for (int i = 0; i < fieldVectors.size(); i++) {
  if (i == index) {
newVectors.add(vector);
  }
  newVectors.add(fieldVectors.get(i));
}
  }
  return new VectorSchemaRoot(newVectors);
}
{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17516) [C++] Concatenate implementation for RLE

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17516:
---
Labels: pull-request-available  (was: )

> [C++] Concatenate implementation for RLE
> 
>
> Key: ARROW-17516
> URL: https://issues.apache.org/jira/browse/ARROW-17516
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> builds on top of ARROW-17419



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17529) Clean up how the CSV reader handles the first buffer

2022-08-25 Thread Ziheng Wang (Jira)
Ziheng Wang created ARROW-17529:
---

 Summary: Clean up how the CSV reader handles the first buffer
 Key: ARROW-17529
 URL: https://issues.apache.org/jira/browse/ARROW-17529
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Ziheng Wang
Assignee: Ziheng Wang


Currently how the CSV reader handles the first block in the CSV is not great.

In fact I think the first block is read multiple times. First in the Peek in 
file_csv.cc and then in the InitFromBlock in the OpenReaderAsync in reader.cc

This could be problematic if the first block is pretty big, and also delays the 
synchronous opening of a dataset.

Possible solution is to use a smaller block size for the peek in file_csv.cc 
since you don't need to read the entire block to GetConvertOptions. So we could 
really just have another option in reader_options that's first_peek_size or 
something like that. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17528) [R] Tidy up the pkgdown articles site

2022-08-25 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17528:


 Summary: [R] Tidy up the pkgdown articles site 
 Key: ARROW-17528
 URL: https://issues.apache.org/jira/browse/ARROW-17528
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


We could better organise the different articles we have to make it easier for 
users to find the right info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17528) [R] Tidy up the pkgdown articles site index

2022-08-25 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17528:
-
Summary: [R] Tidy up the pkgdown articles site index   (was: [R] Tidy up 
the pkgdown articles site )

> [R] Tidy up the pkgdown articles site index 
> 
>
> Key: ARROW-17528
> URL: https://issues.apache.org/jira/browse/ARROW-17528
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> We could better organise the different articles we have to make it easier for 
> users to find the right info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17258) [C++] Handling of array-only types using VisitTypeInline

2022-08-25 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-17258:
---
Summary: [C++] Handling of array-only types using VisitTypeInline  (was: 
[C++] Separate VisitTypeInline for types that can exist as a Scalar)

> [C++] Handling of array-only types using VisitTypeInline
> 
>
> Key: ARROW-17258
> URL: https://issues.apache.org/jira/browse/ARROW-17258
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17432) [R] messed up rows when importing large csv into parquet

2022-08-25 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584903#comment-17584903
 ] 

SHIMA Tatsuya commented on ARROW-17432:
---

Hi, how about passing the schema to the {{col_types}} argument?

{code:r}
csv_stream <- open_dataset(csv_file, format = "csv", 
   col_types = sch)
{code}

Or, using {{readr::read_csv()}}?

I also wonder if the number of rows in the dataset fetched is the same in all 
cases.

> [R] messed up rows when importing large csv into parquet
> 
>
> Key: ARROW-17432
> URL: https://issues.apache.org/jira/browse/ARROW-17432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0
> Environment: R version 4.2.1
> Running in Arch Linux - EndeavourOS
> arrow_info()
> Arrow package version: 9.0.0
> Capabilities:
>
> datasetTRUE
> substrait FALSE
> parquetTRUE
> json   TRUE
> s3 TRUE
> gcsTRUE
> utf8proc   TRUE
> re2TRUE
> snappy TRUE
> gzip   TRUE
> brotli TRUE
> zstd   TRUE
> lz4TRUE
> lz4_frame  TRUE
> lzo   FALSE
> bz2TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>   
> Allocator jemalloc
> Current   49.31 Kb
> Max1.63 Mb
> Runtime:
> 
> SIMD Level  avx2
> Detected SIMD Level avx2
> Build:
>   
> C++ Library Version  9.0.0
> C++ Compiler   GNU
> C++ Compiler Version 7.5.0
> 
> print(pa.__version__)
> 9.0.0
>Reporter: Guillermo Duran
>Priority: Major
>
> This is a weird issue that creates new rows when importing a large csv (56 
> GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT 
> didn't occur with the Python Arrow library 9.0.0. Due to the large size of 
> the original csv it's difficult to create a reproducible example, but I share 
> the code and outputs.
> The code I use in R to import the csv:
> {code:java}
> library(arrow)
> library(dplyr)
>  
> csv_file <- "/ebird_erd2021/full/obs.csv"
> dest <- "/ebird_erd2021/full/obs_parquet/" 
> sch = arrow::schema(checklist_id = float32(),
>                     species_code = string(),
>                     exotic_category = float32(),
>                     obs_count = float32(),
>                     only_presence_reported = float32(),
>                     only_slash_reported = float32(),
>                     valid = float32(),
>                     reviewed = float32(),
>                     has_media = float32()
>                     )
> csv_stream <- open_dataset(csv_file, format = "csv", 
>                            schema = sch, skip_rows = 1)
> write_dataset(csv_stream, dest, format = "parquet", 
>               max_rows_per_file=100L,
>               hive_style = TRUE,
>               existing_data_behavior = "overwrite"){code}
> When I load the dataset and check one random _checklist_id_ I get rows that 
> are not part of the _obs.csv_ file. There shouldn't be duplicated species in 
> a checklist but there are ({_}amerob{_} for example)...  also note that the 
> duplicated species have different {_}obs_count{_}. 50 species in total in 
> that specific {_}checklist_id{_}.
> {code:java}
> parquet_arrow <- open_dataset(dest, format = "parquet")
> parquet_arrow |> 
>   filter(checklist_id == 18543372) |> 
>   arrange(species_code) |> 
>   collect() 
> # A tibble: 50 × 3
>checklist_id species_code obs_count
>
>  1 18543372 altori   3
>  2 18543372 amekes   1
>  3 18543372 amered  40
>  4 18543372 amerob  30
>  5 18543372 amerob   9
>  6 18543372 balori   9
>  7 18543372 blkter   9
>  8 18543372 blkvul  20
>  9 18543372 buggna   1
> 10 18543372 buwwar   1
> # … with 40 more rows
> # ℹ Use `print(n = ...)` to see more rows{code}
> If I use awk to query the csv file with that same checklist id, I get 
> something different:
> {code:java}
> $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
> 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0
> 

[jira] [Updated] (ARROW-17527) [Go] Implement Cast to Boolean Functions

2022-08-25 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-17527:
--
Summary: [Go] Implement Cast to Boolean Functions  (was: [Go] Implement 
Cast Functions)

> [Go] Implement Cast to Boolean Functions
> 
>
> Key: ARROW-17527
> URL: https://issues.apache.org/jira/browse/ARROW-17527
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17526) [R] [Docs] Improve (or really actually document) our Python bridge documentation

2022-08-25 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584896#comment-17584896
 ] 

Nicola Crane commented on ARROW-17526:
--

[~jonkeane] Mind opening a cookbook issue too?  We can just swap out the 
scalar/array content for tables as it's a much more compelling use case.

> [R] [Docs] Improve (or really actually document) our Python bridge 
> documentation 
> -
>
> Key: ARROW-17526
> URL: https://issues.apache.org/jira/browse/ARROW-17526
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jonathan Keane
>Priority: Major
>
> https://twitter.com/jonkeane/status/1560016227824721920?s=20=g2MhdOOJbh0q0MpxPI4R_Q
> When I wrote this, I wished there was a one-page I could show passing a table 
> or recordbatchreader back and forth. 
> https://arrow.apache.org/cookbook/r/using-pyarrow-from-r.html#introduction-4 
> also has some details, but is more focused on scalars and arrays than tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17527) [Go] Implement Cast Functions

2022-08-25 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17527:
-

 Summary: [Go] Implement Cast Functions
 Key: ARROW-17527
 URL: https://issues.apache.org/jira/browse/ARROW-17527
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17455) [Go] Implement Initial Function and Kernel architecture

2022-08-25 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17455.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13964
[https://github.com/apache/arrow/pull/13964]

> [Go] Implement Initial Function and Kernel architecture
> ---
>
> Key: ARROW-17455
> URL: https://issues.apache.org/jira/browse/ARROW-17455
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17526) [R] [Docs] Improve (or really actually document) our Python bridge documentation

2022-08-25 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-17526:
--

 Summary: [R] [Docs] Improve (or really actually document) our 
Python bridge documentation 
 Key: ARROW-17526
 URL: https://issues.apache.org/jira/browse/ARROW-17526
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Jonathan Keane


https://twitter.com/jonkeane/status/1560016227824721920?s=20=g2MhdOOJbh0q0MpxPI4R_Q

When I wrote this, I wished there was a one-page I could show passing a table 
or recordbatchreader back and forth. 
https://arrow.apache.org/cookbook/r/using-pyarrow-from-r.html#introduction-4 
also has some details, but is more focused on scalars and arrays than tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8

2022-08-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17458:
---
Fix Version/s: 10.0.0

> [C++] CSV Writer: Unsupported cast from decimal to utf8 
> 
>
> Key: ARROW-17458
> URL: https://issues.apache.org/jira/browse/ARROW-17458
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Pavel Kovalenko
>Priority: Critical
>  Labels: csv, decimal, good-first-issue, good-second-issue, 
> unsupported
> Fix For: 10.0.0
>
>
> The following code snippet fails with an Unsupported cast error if a table 
> has a decimal column.
> {code:cpp}
> std::shared_ptr table;
> ARROW_CHECK_OK(reader->ReadAll());
> std::shared_ptr output = 
> arrow::io::FileOutputStream::Open(csvPath).ValueOrDie();
> auto writeOptions = arrow::csv::WriteOptions::Defaults();
> writeOptions.include_header = false;
> auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get());
> if (!status.ok()) {
> SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", 
> status.message());
> }
> {code}
> {code:cpp}
> Unsupported cast from decimal128(7, 2) to utf8 using function cast_string
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8

2022-08-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17458:
---
Priority: Critical  (was: Major)

> [C++] CSV Writer: Unsupported cast from decimal to utf8 
> 
>
> Key: ARROW-17458
> URL: https://issues.apache.org/jira/browse/ARROW-17458
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Pavel Kovalenko
>Priority: Critical
>  Labels: csv, decimal, good-first-issue, good-second-issue, 
> unsupported
>
> The following code snippet fails with an Unsupported cast error if a table 
> has a decimal column.
> {code:cpp}
> std::shared_ptr table;
> ARROW_CHECK_OK(reader->ReadAll());
> std::shared_ptr output = 
> arrow::io::FileOutputStream::Open(csvPath).ValueOrDie();
> auto writeOptions = arrow::csv::WriteOptions::Defaults();
> writeOptions.include_header = false;
> auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get());
> if (!status.ok()) {
> SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", 
> status.message());
> }
> {code}
> {code:cpp}
> Unsupported cast from decimal128(7, 2) to utf8 using function cast_string
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17525) [Java] Read ORC files using org.apache.arrow.dataset.jni.NativeDatasetFactory

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17525:
---
Labels: pull-request-available  (was: )

> [Java] Read ORC files using org.apache.arrow.dataset.jni.NativeDatasetFactory 
> --
>
> Key: ARROW-17525
> URL: https://issues.apache.org/jira/browse/ARROW-17525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: Igor Suhorukov
>Assignee: Igor Suhorukov
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Support ORC file format in java Dataset API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17525) [Java] Read ORC files using org.apache.arrow.dataset.jni.NativeDatasetFactory

2022-08-25 Thread Igor Suhorukov (Jira)
Igor Suhorukov created ARROW-17525:
--

 Summary: [Java] Read ORC files using 
org.apache.arrow.dataset.jni.NativeDatasetFactory 
 Key: ARROW-17525
 URL: https://issues.apache.org/jira/browse/ARROW-17525
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 9.0.0
Reporter: Igor Suhorukov
Assignee: Igor Suhorukov


Support ORC file format in java Dataset API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17519) [R] RTools35 job is failing

2022-08-25 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584876#comment-17584876
 ] 

Dewey Dunnington commented on ARROW-17519:
--

Sure! https://lists.apache.org/thread/h9v83rwdl015z2j6s8zwdr1qp4svb5j8

> [R] RTools35 job is failing
> ---
>
> Key: ARROW-17519
> URL: https://issues.apache.org/jira/browse/ARROW-17519
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>
> After ARROW-17436, the RTools35 job is consistently failing with:
> {noformat}
> Error: Error: package or namespace load failed for 'arrow' in inDL(x, 
> as.logical(local), as.logical(now), ...):
>  unable to load shared object 
> 'D:/a/arrow/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/i386/arrow.dll':
>   LoadLibrary failure:  A dynamic link library (DLL) initialization routine 
> failed.
> {noformat}
> Given that there is a mailing list discussion about dropping support for that 
> platform, should we disable the check? Or wait until that is resolved to 
> disable the check?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17524) The ORC reader method ReadStripe does not work when we specify fields to selected as a list of integers

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17524:
---
Labels: pull-request-available  (was: )

> The ORC reader method ReadStripe does not work when we specify fields to 
> selected as a list of integers
> ---
>
> Key: ARROW-17524
> URL: https://issues.apache.org/jira/browse/ARROW-17524
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.1
>Reporter: Louis Calot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I think there is a bug in the ORC reader : when we specify the fields indexes 
> that we want to keep, it does not work correctly. Looking at the code, it 
> seems to be because we do "includeTypes" in lieue of "include" when setting 
> the ORC options.
> It can be problematic when we want to import an ORC table containing Union 
> types as it will do an error at the import, even if we try not to import 
> these specific fields.
> The definitions of the corresponding ORC methods are here :
> [https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191]
> and
> [https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17524) The ORC reader method ReadStripe does not work when we specify fields to selected as a list of integers

2022-08-25 Thread Louis Calot (Jira)
Louis Calot created ARROW-17524:
---

 Summary: The ORC reader method ReadStripe does not work when we 
specify fields to selected as a list of integers
 Key: ARROW-17524
 URL: https://issues.apache.org/jira/browse/ARROW-17524
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.1
Reporter: Louis Calot


I think there is a bug in the ORC reader : when we specify the fields indexes 
that we want to keep, it does not work correctly. Looking at the code, it seems 
to be because we do "includeTypes" in lieue of "include" when setting the ORC 
options.
It can be problematic when we want to import an ORC table containing Union 
types as it will do an error at the import, even if we try not to import these 
specific fields.

The definitions of the corresponding ORC methods are here :
[https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191]

and
[https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8

2022-08-25 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-17458:
-
Labels: csv decimal good-first-issue good-second-issue unsupported  (was: 
csv decimal unsupported)

> [C++] CSV Writer: Unsupported cast from decimal to utf8 
> 
>
> Key: ARROW-17458
> URL: https://issues.apache.org/jira/browse/ARROW-17458
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Pavel Kovalenko
>Priority: Major
>  Labels: csv, decimal, good-first-issue, good-second-issue, 
> unsupported
>
> The following code snippet fails with an Unsupported cast error if a table 
> has a decimal column.
> {code:cpp}
> std::shared_ptr table;
> ARROW_CHECK_OK(reader->ReadAll());
> std::shared_ptr output = 
> arrow::io::FileOutputStream::Open(csvPath).ValueOrDie();
> auto writeOptions = arrow::csv::WriteOptions::Defaults();
> writeOptions.include_header = false;
> auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get());
> if (!status.ok()) {
> SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", 
> status.message());
> }
> {code}
> {code:cpp}
> Unsupported cast from decimal128(7, 2) to utf8 using function cast_string
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8

2022-08-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584853#comment-17584853
 ] 

Jonathan Keane commented on ARROW-17458:


We ran into this issue today as well, working on conversions for benchmarking 
datasets

> [C++] CSV Writer: Unsupported cast from decimal to utf8 
> 
>
> Key: ARROW-17458
> URL: https://issues.apache.org/jira/browse/ARROW-17458
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Pavel Kovalenko
>Priority: Major
>  Labels: csv, decimal, unsupported
>
> The following code snippet fails with an Unsupported cast error if a table 
> has a decimal column.
> {code:cpp}
> std::shared_ptr table;
> ARROW_CHECK_OK(reader->ReadAll());
> std::shared_ptr output = 
> arrow::io::FileOutputStream::Open(csvPath).ValueOrDie();
> auto writeOptions = arrow::csv::WriteOptions::Defaults();
> writeOptions.include_header = false;
> auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get());
> if (!status.ok()) {
> SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", 
> status.message());
> }
> {code}
> {code:cpp}
> Unsupported cast from decimal128(7, 2) to utf8 using function cast_string
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17262) [C++] Kernel input type matcher for RLE

2022-08-25 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-17262:
---
Description: Builds on top of ARROW-17261

> [C++] Kernel input type matcher for RLE
> ---
>
> Key: ARROW-17262
> URL: https://issues.apache.org/jira/browse/ARROW-17262
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>
> Builds on top of ARROW-17261



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17263) [C++] Utility functions for working with RLE

2022-08-25 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-17263:
---
Description: based on top of ARROW-17261

> [C++] Utility functions for working with RLE
> 
>
> Key: ARROW-17263
> URL: https://issues.apache.org/jira/browse/ARROW-17263
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> based on top of ARROW-17261



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17261) [C++] Add type ID, Type and Array classes for RLE

2022-08-25 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni updated ARROW-17261:
---
Description: 
Based on top of ARROW-17258

Mostly picking these parts from ARROW-16772 and ARROW-16781 to create an easier 
order to merge things

  was:Mostly picking these parts from ARROW-16772 and ARROW-16781 to create an 
easier order to merge things


> [C++] Add type ID, Type and Array classes for RLE
> -
>
> Key: ARROW-17261
> URL: https://issues.apache.org/jira/browse/ARROW-17261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Based on top of ARROW-17258
> Mostly picking these parts from ARROW-16772 and ARROW-16781 to create an 
> easier order to merge things



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-25 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584808#comment-17584808
 ] 

Arthur Passos commented on ARROW-17459:
---

I am also trying to write test to cover this case, but failing to do so. For 
some reason, the files I generate with the very same schema and size don't get 
chunked while reading it. The original file was provided by a customer and it's 
confidential data, so it can't be used.

 

All the files I generated contain the above mentioned schema. The differences 
are in the data length. Some had maps of 50~300 elements with keys of random 
strings of 20~50 characters and values of random strings of 50~5000 characters. 
I also tried a low cardinality example and a large string example (2^30 
characters).

 

I'd be very thankful if someone could give me some tips on how to generate a 
file that will trigger the exception.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17327) [Python] Parquet should be listed in PyArrow's get_libraries() function

2022-08-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584795#comment-17584795
 ] 

Antoine Pitrou commented on ARROW-17327:


[~willjones127] {{get_libraries()}} is tested in {{test_cython.py}}.

I wonder what is different here that requires adding {{parquet}} while the 
tests generally run fine.

> [Python] Parquet should be listed in PyArrow's get_libraries() function
> ---
>
> Key: ARROW-17327
> URL: https://issues.apache.org/jira/browse/ARROW-17327
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Steven Silvester
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We are updating {{PyMongoArrow}} to use PyArrow 8.0, and saw the following 
> [failure| 
> https://github.com/mongodb-labs/mongo-arrow/runs/7696619223?check_suite_focus=true]
>  when building wheels:  "@rpath/libparquet.800.dylib not found".
> We overcame the error by explicitly adding "parquet" to the list of libraries 
> returned by {{get_libraries}}.  I am happy to submit a PR.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17523) [C++] Support more substrait function

2022-08-25 Thread Jin Chengcheng (Jira)
Jin Chengcheng created ARROW-17523:
--

 Summary: [C++] Support more substrait function
 Key: ARROW-17523
 URL: https://issues.apache.org/jira/browse/ARROW-17523
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 10.0.0
Reporter: Jin Chengcheng
Assignee: Jin Chengcheng


support is_null, is_not_null, count function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17518) [CI][Docs][Python] Development version is not correctly detected from git

2022-08-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17518.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13966
[https://github.com/apache/arrow/pull/13966]

> [CI][Docs][Python] Development version is not correctly detected from git
> -
>
> Key: ARROW-17518
> URL: https://issues.apache.org/jira/browse/ARROW-17518
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Documentation, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
> Attachments: image-2022-08-24-18-32-00-888.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The current glob used on our git commands to detect the development version 
> is not correct and can be seen on the published docs:
> !image-2022-08-24-18-32-00-888.png!
> Reproduced on bash:
> {code:java}
> $ git describe --dirty --tags --long
> apache-arrow-10.0.0.dev-113-g28b81ec-dirty
> $ git describe --dirty --tags --long --match "apache-arrow-[0-9].*"
> apache-arrow-9.0.0.dev-640-g28b81ec-dirty {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15277) [Python] Use Make to create ChunkedArray and remove checks

2022-08-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15277.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13950
[https://github.com/apache/arrow/pull/13950]

> [Python] Use Make to create ChunkedArray and remove checks
> --
>
> Key: ARROW-15277
> URL: https://issues.apache.org/jira/browse/ARROW-15277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Eduardo Ponce
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In PyArrow, the {{ChunkedArray}} constructor function validates the input 
> {{Arrays}} in terms of omitted type and same types, but these checks are 
> already made in the underlying C++ via {{ChunkedArray::Make}}. Need to expose 
> the {{Make()}} to use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17495) [R] arrow_eval: do we need both nse_funcs and .cache$functions?

2022-08-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584703#comment-17584703
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17495:
--

I can look into it (maybe after I finish with the capstone).

As far as I can tell, both {{.cache$functions}} and {{nse_funcs}} are created 
at load time. {{cache$functions}} is {{nse_funcs}} + Arrow Compute functions 
(prefixed with {{{}arrow_{}}}). 
I bumped into this while trying to register / translate a user-defined function 
with pre-existing bindings. I needed to update either {{cache.functions}} or 
{{nse_funcs}} - we can update the former via {{{}update_cache = TRUE{}}}, but 
then I had to change {{call_binding()}} to fetch from the updated {{cache}} and 
not from {{{}nse_funcs{}}}. This led me to think that folks might be confused 
by these 2 objects that overlap by quite a bit and are in some situations 
interchangeable (mostly nse_funcs can be replaced by {{cache$functions}} which 
includes it). 

> [R] arrow_eval: do we need both nse_funcs and .cache$functions?
> ---
>
> Key: ARROW-17495
> URL: https://issues.apache.org/jira/browse/ARROW-17495
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Minor
>
> Currently we have 2 copies of the same information, once in {{nse_funcs}} and 
> once in {{{}.cache$functions{}}}. I wasn't able to figure out the reason for 
> this. Maybe I am missing something or maybe this is just legacy code that we 
> can update.   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17433) [C++] AppVeyor build fails due to Boost/S3

2022-08-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17433.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13903
[https://github.com/apache/arrow/pull/13903]

> [C++] AppVeyor build fails due to Boost/S3
> --
>
> Key: ARROW-17433
> URL: https://issues.apache.org/jira/browse/ARROW-17433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Observed on master
> {noformat}
> [182/351] Building CXX object 
> src\arrow\filesystem\CMakeFiles\arrow-s3fs-test.dir\Unity\unity_0_cxx.cxx.obj
> FAILED: 
> src/arrow/filesystem/CMakeFiles/arrow-s3fs-test.dir/Unity/unity_0_cxx.cxx.obj 
> C:\Miniconda37-x64\Scripts\clcache.exe  /nologo /TP -DARROW_HAVE_RUNTIME_AVX2 
> -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 
> -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_HDFS -DARROW_MIMALLOC 
> -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
> -DARROW_WITH_SNAPPY -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD 
> -DAWS_CAL_USE_IMPORT_EXPORT -DAWS_CHECKSUMS_USE_IMPORT_EXPORT 
> -DAWS_COMMON_USE_IMPORT_EXPORT -DAWS_EVENT_STREAM_USE_IMPORT_EXPORT 
> -DAWS_IO_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 
> -DAWS_SDK_VERSION_MINOR=8 -DAWS_SDK_VERSION_PATCH=186 
> -DAWS_USE_IO_COMPLETION_PORTS -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
> -DBOOST_ATOMIC_DYN_LINK -DBOOST_ATOMIC_NO_LIB -DBOOST_FILESYSTEM_DYN_LINK 
> -DBOOST_FILESYSTEM_NO_LIB -DBOOST_SYSTEM_DYN_LINK -DBOOST_SYSTEM_NO_LIB 
> -DPROTOBUF_USE_DLLS -DURI_STATIC_BUILD -DUSE_IMPORT_EXPORT 
> -DUSE_IMPORT_EXPORT=1 -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS 
> -D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\projects\arrow\cpp\build\src 
> -IC:\projects\arrow\cpp\src -IC:\projects\arrow\cpp\src\generated 
> -IC:\projects\arrow\cpp\thirdparty\flatbuffers\include 
> -IC:\Miniconda37-x64\envs\arrow\Library\include 
> -IC:\projects\arrow\cpp\thirdparty\hadoop\include 
> -IC:\projects\arrow\cpp\build\mimalloc_ep\src\mimalloc_ep\include\mimalloc-2.0
>  /DWIN32 /D_WINDOWS  /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   
> /EHsc /wd5105 /bigobj /utf-8 /W3 /wd4800 /wd4996 /wd4065  /WX /MP /MD /Od 
> /UNDEBUG /showIncludes 
> /Fosrc\arrow\filesystem\CMakeFiles\arrow-s3fs-test.dir\Unity\unity_0_cxx.cxx.obj
>  /Fdsrc\arrow\filesystem\CMakeFiles\arrow-s3fs-test.dir\ /FS -c 
> C:\projects\arrow\cpp\build\src\arrow\filesystem\CMakeFiles\arrow-s3fs-test.dir\Unity\unity_0_cxx.cxx
> Please define _WIN32_WINNT or _WIN32_WINDOWS appropriately. For example:
> - add -D_WIN32_WINNT=0x0601 to the compiler command line; or
> - add _WIN32_WINNT=0x0601 to your project's Preprocessor Definitions.
> Assuming _WIN32_WINNT=0x0601 (i.e. Windows 7 target).
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/environment.hpp(266):
>  error C2220: warning treated as error - no 'object' file generated
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/environment.hpp(261):
>  note: while compiling class template member function 
> 'boost::iterators::transform_iterator>,Char
>  
> **,boost::process::detail::entry>,boost::process::detail::entry>>
>  
> boost::process::basic_environment_impl::find(const
>  std::basic_string,std::allocator> &)'
> with
> [
> Char=char
> ]
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/environment.hpp(361):
>  note: see reference to function template instantiation 
> 'boost::iterators::transform_iterator>,Char
>  
> **,boost::process::detail::entry>,boost::process::detail::entry>>
>  
> boost::process::basic_environment_impl::find(const
>  std::basic_string,std::allocator> &)' 
> being compiled
> with
> [
> Char=char
> ]
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/environment.hpp(632):
>  note: see reference to class template instantiation 
> 'boost::process::basic_environment_impl'
>  being compiled
> with
> [
> Char=char
> ]
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/env.hpp(176): 
> note: see reference to class template instantiation 
> 'boost::process::basic_environment' being compiled
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/process/env.hpp(183): 
> note: see reference to class template instantiation 
> 'boost::process::detail::env_init' being compiled
> C:\Miniconda37-x64\envs\arrow\Library\include\boost/asio/execution/relationship.hpp(595):
>  note: see reference to class template instantiation 
>