[jira] [Commented] (ARROW-17088) [R] Use `.arrow` as extension of IPC files of datasets

2022-07-20 Thread Kazuyuki Ura (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569276#comment-17569276
 ] 

Kazuyuki Ura commented on ARROW-17088:
--

I'll work on this.

> [R] Use `.arrow` as extension of IPC files of datasets
> --
>
> Key: ARROW-17088
> URL: https://issues.apache.org/jira/browse/ARROW-17088
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Related to ARROW-17072
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the {{write_dataset}} 
> function, the default extension is {{.feather}} when {{feather}} is selected 
> as the format, and {{.ipc}} when {{ipc}} is selected.
> https://github.com/apache/arrow/blob/f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8/r/R/dataset-write.R#L124-L126



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16389) [C++] Support hash-join on larger than memory datasets

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16389:
---
Labels: pull-request-available  (was: )

> [C++] Support hash-join on larger than memory datasets
> --
>
> Key: ARROW-16389
> URL: https://issues.apache.org/jira/browse/ARROW-16389
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current implementation of the hash-join node current queues in memory the 
> hashtable, the entire build side input, and the entire probe side input (e.g. 
> the entire dataset).  This means the current implementation will run out of 
> memory and crash if the input dataset is larger than the memory on the system.
> By spilling to disk when memory starts to fill up we can allow the hash-join 
> node to process datasets larger than the available memory on the machine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17001) [Release][R] Use apache artifactory for libarrow binaries.

2022-07-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17001.
--
Resolution: Fixed

Issue resolved by pull request 13622
[https://github.com/apache/arrow/pull/13622]

> [Release][R] Use apache artifactory for libarrow binaries.
> --
>
> Key: ARROW-17001
> URL: https://issues.apache.org/jira/browse/ARROW-17001
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Move the pre-built libarrow binaries for the release version from nightlies 
> to the apache artifactory and add this to the release process/scripts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17151.
--
Resolution: Fixed

Issue resolved by pull request 13663
[https://github.com/apache/arrow/pull/13663]

> [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
> 
>
> Key: ARROW-17151
> URL: https://issues.apache.org/jira/browse/ARROW-17151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> pydata-sphinx-theme introduced automatic dark mode. However there is a series 
> of changes we need to do (such as providing a dark-mode Arrow logo) before we 
> will be ready for this (see ARROW-17152). For the 9.0.0 release, we should 
> instead pin to the version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17153) [CI][Homebrew] Require glib-utils

2022-07-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17153.
--
Resolution: Fixed

Issue resolved by pull request 13666
[https://github.com/apache/arrow/pull/13666]

> [CI][Homebrew] Require glib-utils
> -
>
> Key: ARROW-17153
> URL: https://issues.apache.org/jira/browse/ARROW-17153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17154) [C++] Change cmake project name from arrow_python to pyarrow_cpp

2022-07-20 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-17154:
---

 Summary: [C++] Change cmake project name from arrow_python to 
pyarrow_cpp
 Key: ARROW-17154
 URL: https://issues.apache.org/jira/browse/ARROW-17154
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 10.0.0


See discussion https://github.com/apache/arrow/pull/13311#discussion_r926198302



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17153) [CI][Homebrew] Require glib-utils

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17153:
---
Labels: pull-request-available  (was: )

> [CI][Homebrew] Require glib-utils
> -
>
> Key: ARROW-17153
> URL: https://issues.apache.org/jira/browse/ARROW-17153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17153) [CI][Homebrew] Require glib-utils

2022-07-20 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-17153:


 Summary: [CI][Homebrew] Require glib-utils
 Key: ARROW-17153
 URL: https://issues.apache.org/jira/browse/ARROW-17153
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17100) [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17100:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written 
> prior to 3.0.0 per ARROW-10353
> --
>
> Key: ARROW-17100
> URL: https://issues.apache.org/jira/browse/ARROW-17100
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Wes McKinney
>Assignee: Will Jones
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As described in 
> https://lists.apache.org/thread/xkrhgfpk9sr1mj74d4chz3r5yp3szt6c, 
> https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69
> Caused some files written prior to 3.0.0 to be unreadable. Given that the 
> patch was small, this will hopefully not be too difficult to fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17125) Unable to install pyarrow on Debian 10 (i686)

2022-07-20 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17125.
--
Resolution: Won't Fix

I mark this issue as resolved because we can use wheel on 64-bit environment.

> Unable to install pyarrow on Debian 10 (i686)
> -
>
> Key: ARROW-17125
> URL: https://issues.apache.org/jira/browse/ARROW-17125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.1, 8.0.1
> Environment: Debian GNU/Linux 10 (buster)
> Python 3.9.7
> pip 22.1.2 
> cmake 3.22.5
> $ lscpu
> Architecture:        i686
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> Address sizes:       45 bits physical, 48 bits virtual
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  1
> Core(s) per socket:  1
> Socket(s):           4
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               45
> Model name:          Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> Stepping:            7
> CPU MHz:             1995.000
> BogoMIPS:            3990.00
> Hypervisor vendor:   VMware
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            20480K
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss nx rdtscp lm constant_tsc 
> arch_perfmon xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 
> cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor 
> lahf_lm pti ssbd ibrs ibpb stibp tsc_adjust arat md_clear flush_l1d 
> arch_capabilities  
>Reporter: Rustam Guliev
>Priority: Major
>
> Hi,
> I am not able to install pyarrow on Debian 10. First, the installation (via 
> `pip` or `poetry install`) fails with the following:
>  
> {code:java}
>   EnvCommandError  Command 
> ['/home/rustam/.cache/pypoetry/virtualenvs/spectra-annotator-Vr_f9e53-py3.9/bin/pip',
>  'install', '--no-deps', 
> 'file:///home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz']
>  errored with the following return code 1, and output:
>   Processing 
> /home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz
>     Installing build dependencies: started
>     Installing build dependencies: finished with status 'done'
>     Getting requirements to build wheel: started
>     Getting requirements to build wheel: finished with status 'done'
>     Preparing metadata (pyproject.toml): started
>     Preparing metadata (pyproject.toml): finished with status 'done'
>   Building wheels for collected packages: pyarrow
>     Building wheel for pyarrow (pyproject.toml): started
>     Building wheel for pyarrow (pyproject.toml): finished with status 'error'
>     error: subprocess-exited-with-error    × Building wheel for pyarrow 
> (pyproject.toml) did not run successfully.
>     │ exit code: 1
>     ╰─> [261 lines of output]
>         running bdist_wheel
>         running build
>         running build_py
>         running egg_info
>         writing pyarrow.egg-info/PKG-INFO
>         writing dependency_links to pyarrow.egg-info/dependency_links.txt
>         writing entry points to pyarrow.egg-info/entry_points.txt
>         writing requirements to pyarrow.egg-info/requires.txt
>         writing top-level names to pyarrow.egg-info/top_level.txt
>         listing git files failed - pretending there aren't any
>         reading manifest file 'pyarrow.egg-info/SOURCES.txt'
>         reading manifest template 'MANIFEST.in'
>         warning: no files found matching '../LICENSE.txt'
>         warning: no files found matching '../NOTICE.txt'
>         warning: no previously-included files matching '*.so' found anywhere 
> in distribution
>         warning: no previously-included files matching '*.pyc' found anywhere 
> in distribution
>         warning: no previously-included files matching '*~' found anywhere in 
> distribution
>         warning: no previously-included files matching '#*' found anywhere in 
> distribution
>         warning: no previously-included files matching '.git*' found anywhere 
> in distribution
>         warning: no previously-included files matching '.DS_Store' found 
> anywhere in distribution
>         no previously-included directories found matching '.asv'
>         
> /tmp/pip-build-env-umvxn44o/overlay/lib/python3.9/site-packages/setuptools/command/build_py.py:153:
>  SetuptoolsDeprecationWarning:     Installing 'pyarrow.includes' as data is 
> deprecated, please list it in `packages`.
>             !!
>             
>             # Package would be 

[jira] [Assigned] (ARROW-17100) [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353

2022-07-20 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17100:
--

Assignee: Will Jones

> [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written 
> prior to 3.0.0 per ARROW-10353
> --
>
> Key: ARROW-17100
> URL: https://issues.apache.org/jira/browse/ARROW-17100
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Wes McKinney
>Assignee: Will Jones
>Priority: Blocker
> Fix For: 9.0.0
>
>
> As described in 
> https://lists.apache.org/thread/xkrhgfpk9sr1mj74d4chz3r5yp3szt6c, 
> https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69
> Caused some files written prior to 3.0.0 to be unreadable. Given that the 
> patch was small, this will hopefully not be too difficult to fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15026) [Python] datetime.timedelta to pyarrow.duration('us') silently overflows

2022-07-20 Thread Anja Boskovic (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anja Boskovic reassigned ARROW-15026:
-

Assignee: Anja Boskovic

> [Python] datetime.timedelta to pyarrow.duration('us') silently overflows
> 
>
> Key: ARROW-15026
> URL: https://issues.apache.org/jira/browse/ARROW-15026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andreas Rappold
>Assignee: Anja Boskovic
>Priority: Major
>
>  
> Hi! This reproduces the issue:
> {code:java}
> # python 3.9.9
> # pyarrow 6.0.1
> import datetime
> import pyarrow
> d = datetime.timedelta(days=-106751992, seconds=71945, microseconds=224192)
> pyarrow.scalar(d)
> #  microseconds=224192)>
> pyarrow.scalar(d).as_py() == d
> # True
> d2 = d - datetime.timedelta(microseconds=1)
> pyarrow.scalar(d2)
> #  microseconds=775807)>
> pyarrow.scalar(d2).as_py() == d2
> # False{code}
> Other conversions (e.g. to int*) raise an exception instead. I didn't check 
> if duration overflows for too large timedeltas. If its easy to fix, point me 
> in the right direction and I try to create a PR. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17151:
---
Labels: pull-request-available  (was: )

> [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
> 
>
> Key: ARROW-17151
> URL: https://issues.apache.org/jira/browse/ARROW-17151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> pydata-sphinx-theme introduced automatic dark mode. However there is a series 
> of changes we need to do (such as providing a dark-mode Arrow logo) before we 
> will be ready for this (see ARROW-17152). For the 9.0.0 release, we should 
> instead pin to the version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17151:
---
Description: pydata-sphinx-theme introduced automatic dark mode. However 
there is a series of changes we need to do (such as providing a dark-mode Arrow 
logo) before we will be ready for this (see ARROW-17152). For the 9.0.0 
release, we should instead pin to the version of pydata-sphinx-theme just 
before that release.  (was: pydata-sphinx-theme introduced automatic dark mode. 
However there is a series of changes we need to do (such as providing a 
dark-mode Arrow logo) before we will be ready for this. For the 9.0.0 release, 
we should instead pin to the version of pydata-sphinx-theme just before that 
release.)

> [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
> 
>
> Key: ARROW-17151
> URL: https://issues.apache.org/jira/browse/ARROW-17151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> pydata-sphinx-theme introduced automatic dark mode. However there is a series 
> of changes we need to do (such as providing a dark-mode Arrow logo) before we 
> will be ready for this (see ARROW-17152). For the 9.0.0 release, we should 
> instead pin to the version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17152) [Docs] Enable dark mode on documentation site

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17152:
--

 Summary: [Docs] Enable dark mode on documentation site
 Key: ARROW-17152
 URL: https://issues.apache.org/jira/browse/ARROW-17152
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Will Jones
 Fix For: 10.0.0
 Attachments: Screen Shot 2022-07-20 at 3.10.51 PM.png, Screen Shot 
2022-07-20 at 3.12.18 PM.png

pydata-sphinx-theme adds dark mode in version 0.9.0. We will need to adapt our 
logo ([see 
docs|https://pydata-sphinx-theme.readthedocs.io/en/stable/user_guide/configuring.html?highlight=dark#different-logos-for-light-and-dark-mode]).
 There are also some places in the docs where we may need to adjust additional 
CSS. See attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17151:
--

 Summary: [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
 Key: ARROW-17151
 URL: https://issues.apache.org/jira/browse/ARROW-17151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones
 Fix For: 9.0.0


pydata-sphinx-theme introduced automatic dark mode. However there is a series 
of changes we need to do (such as providing a dark-mode Arrow logo) before we 
will be ready for this. For the 9.0.0 release, we should instead pin to the 
version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17151:
--

Assignee: Will Jones

> [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
> 
>
> Key: ARROW-17151
> URL: https://issues.apache.org/jira/browse/ARROW-17151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> pydata-sphinx-theme introduced automatic dark mode. However there is a series 
> of changes we need to do (such as providing a dark-mode Arrow logo) before we 
> will be ready for this. For the 9.0.0 release, we should instead pin to the 
> version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-11699) [R] Implement dplyr::across()

2022-07-20 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569183#comment-17569183
 ] 

Nicola Crane commented on ARROW-11699:
--

This is something that there has been user interest in on Twitter

> [R] Implement dplyr::across()
> -
>
> Key: ARROW-11699
> URL: https://issues.apache.org/jira/browse/ARROW-11699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 10.0.0
>
>
> It's not a generic, but because it seems only to be called inside of 
> functions like `mutate()`, we can insert our own version of it into the NSE 
> data mask



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

2022-07-20 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569175#comment-17569175
 ] 

Jonathan Keane commented on ARROW-17115:


A reprex that causes this from R (which is effectively the TPC-H 12 query that 
segfaults):


{code:r}
library(arrow)
library(dplyr)
library(arrowbench)

ensure_source("tpch", scale_factor = 10)

open_dataset("data/lineitem_10.parquet") %>%
  filter(
l_shipmode %in% c("MAIL", "SHIP"),
l_commitdate < l_receiptdate,
l_shipdate < l_commitdate,
l_receiptdate >= as.Date("1994-01-01"),
l_receiptdate < as.Date("1995-01-01")
  ) %>%
  inner_join(
open_dataset("data/orders_10.parquet"),
by = c("l_orderkey" = "o_orderkey")
  ) %>%
  group_by(l_shipmode) %>%
  summarise(
high_line_count = sum(
  if_else(
(o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"),
1L,
0L
  )
),
low_line_count = sum(
  if_else(
(o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"),
1L,
0L
  )
)
  ) %>%
  ungroup() %>%
  arrange(l_shipmode) %>%
  collect()
{code}

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> --
>
> Key: ARROW-17115
> URL: https://issues.apache.org/jira/browse/ARROW-17115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17127) [C++] Sporadic crash in arrow-dataset-scanner-test (1)

2022-07-20 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569164#comment-17569164
 ] 

Will Jones commented on ARROW-17127:


Is this a duplicate of ARROW-17087?

> [C++] Sporadic crash in arrow-dataset-scanner-test (1)
> --
>
> Key: ARROW-17127
> URL: https://issues.apache.org/jira/browse/ARROW-17127
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 9.0.0
>
>
> See GDB backtrace at 
> https://gist.github.com/pitrou/ef47ab902cbbba80440ee0375a1d7ed3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17135) [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc

2022-07-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-17135.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13654
[https://github.com/apache/arrow/pull/13654]

> [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc
> -
>
> Key: ARROW-17135
> URL: https://issues.apache.org/jira/browse/ARROW-17135
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> I had noticed the large symbol sizes in scalar_compare.cc when looking at the 
> shared library. I had a quick hack on the plane to try to reduce the code size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17149) [R] Enable GCS tests for Windows

2022-07-20 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569158#comment-17569158
 ] 

Will Jones commented on ARROW-17149:


We likely will need to do ARROW-17150 first in order to debug this.

> [R] Enable GCS tests for Windows
> 
>
> Key: ARROW-17149
> URL: https://issues.apache.org/jira/browse/ARROW-17149
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Affects Versions: 9.0.0
>Reporter: Will Jones
>Priority: Major
> Fix For: 10.0.0
>
>
> In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't 
> diagnose why. We should solve that and enable the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17150) [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17150:
--

 Summary: [R] Allow statically linked libcurl in GCS when building 
libarrow DLL in RTools
 Key: ARROW-17150
 URL: https://issues.apache.org/jira/browse/ARROW-17150
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


Neal's patch in ARROW-16510 enabled libcurl to be linked statically in the 
google cloud storage dependency, but this only seems to work for static 
libraries on RTools (Windows). For development Rtools environments, we 
currently use dynamic Arrow libraries instead, but currently we get linking 
errors to libcurl when ARROW_GCS is on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17149) [R] Enable GCS tests for Windows

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17149:
--

 Summary: [R] Enable GCS tests for Windows
 Key: ARROW-17149
 URL: https://issues.apache.org/jira/browse/ARROW-17149
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't diagnose 
why. We should solve that and enable the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569157#comment-17569157
 ] 

Antoine Pitrou edited comment on ARROW-16000 at 7/20/22 7:47 PM:
-

Ideally it could, except that a FileSource is suppose to provide a 
RandomAccessFile, not an InputStream. A transformation callback can only work 
for those file formats (CSV for the moment, JSON later on) that read files in a 
purely streaming manner.


was (Author: pitrou):
Ideally it could, except that a FileSource is suppose to provide a 
RandomAccessFile, not an InputStream. A transformation callback can only work 
for those file format that read files in a purely streaming manner.

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569157#comment-17569157
 ] 

Antoine Pitrou commented on ARROW-16000:


Ideally it could, except that a FileSource is suppose to provide a 
RandomAccessFile, not an InputStream. A transformation callback can only work 
for those file format that read files in a purely streaming manner.

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface

2022-07-20 Thread Nigel McKernan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569154#comment-17569154
 ] 

Nigel McKernan edited comment on ARROW-8813 at 7/20/22 7:42 PM:


The issue [~domiden] references was committed into {{tidyr}}  1.1.0 back in May 
of 2020, as you can see 
[here|https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames],
 more than 2 years ago.

 

Would it be possible now to incorporate some {{tidyr}} methods that have been 
converted to generics into {{{}arrow{}}}?

EDIT: As well, the {{nest()}} generic is now 
[lazily-evaluated|https://github.com/tidyverse/tidyr/releases#:~:text=The%20nest()%20generic%20now%20avoids%20computing%20on%20.data%2C%20making%20it%20more%0Acompatible%20with%20lazy%20tibbles],
 making it easier to do remote operations, as of the {{tidyr}} 1.2.0 release 
earlier this year.


was (Author: JIRAUSER293150):
The issue [~domiden] references was committed into {{tidyr}}  1.1.0 back in May 
of 2020, as you can see 
[here|https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames],
 more than 2 years ago.

 

Would it be possible now to incorporate some {{tidyr}} methods that have been 
converted to generics into {{{}arrow{}}}?

EDIT: As well, the {{nest()}} generic is now 
[lazily-evaluated|https://github.com/tidyverse/tidyr/releases#:~:text=The%20nest()%20generic%20now%20avoids%20computing%20on%20.data%2C%20making%20it%20more%0Acompatible%20with%20lazy%20tibbles],
 making it easier to do remote operations.

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface

2022-07-20 Thread Nigel McKernan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569154#comment-17569154
 ] 

Nigel McKernan edited comment on ARROW-8813 at 7/20/22 7:42 PM:


The issue [~domiden] references was committed into {{tidyr}}  1.1.0 back in May 
of 2020, as you can see 
[here|https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames],
 more than 2 years ago.

 

Would it be possible now to incorporate some {{tidyr}} methods that have been 
converted to generics into {{{}arrow{}}}?

EDIT: As well, the {{nest()}} generic is now 
[lazily-evaluated|https://github.com/tidyverse/tidyr/releases#:~:text=The%20nest()%20generic%20now%20avoids%20computing%20on%20.data%2C%20making%20it%20more%0Acompatible%20with%20lazy%20tibbles],
 making it easier to do remote operations.


was (Author: JIRAUSER293150):
The issue [~domiden] references was committed into {{tidyr}}  1.1.0 back in May 
of 2020, as you can see 
[here|https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames],
 more than 2 years ago.

 

Would it be possible now to incorporate some {{tidyr}} methods that have been 
converted to generics into {{{}arrow{}}}?

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-8813) [R] Implementing tidyr interface

2022-07-20 Thread Nigel McKernan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569154#comment-17569154
 ] 

Nigel McKernan commented on ARROW-8813:
---

The issue [~domiden] references was committed into {{tidyr}}  1.1.0 back in May 
of 2020, as you can see 
[here|https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames],
 more than 2 years ago.

 

Would it be possible now to incorporate some {{tidyr}} methods that have been 
converted to generics into {{{}arrow{}}}?

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569153#comment-17569153
 ] 

Weston Pace commented on ARROW-16000:
-

Also, an arbitrary "input stream transformation callback" could also be a 
property of ScanOptions now that I think about it.  This might be more useful 
once we support scanning JSON files.  This could be used if the JSON files are 
compressed to add a decompressor to the stream.

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569152#comment-17569152
 ] 

Weston Pace commented on ARROW-16000:
-

For more context.  A {{fragment}} is a term introduced by the datasets API.  
The goal of the datasets API is to read data in from a collection on 
independently scannable fragments (in practice, fragment usually equals file).

The scanning process has its own set of options, ScanOptions, (e.g. 
use_threads, batch size, projection, etc.) which are independent of the file 
format.  Each file reader has its own set of options (e.g. ReadOptions) and 
it's completely unaware of any dataset scanner.

Now, pretend you want to scan a collection of CSV files with a custom delimiter 
(e.g. |).  It doesn't make sense for delimiter to be a property of scan options 
because it is specific to CSV.

As a result, we have ScanOptions::fragment_scan_options.  This is an interface 
that each format provides an implementation for, which can be provided for the 
scan.

So, to read a CSV file with a custom delimiter, you just create ParseOptions 
with the correct delimiter.  To read a dataset of CSV files with a custom 
delimiter you first create scan options for the scan itself, and then a 
ParseOptions with the custom delimiter, and then link them via 
ScanOptions::fragment_scan_options.

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17148) [R] Improve evaluation of R functions from C++

2022-07-20 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-17148:


 Summary: [R] Improve evaluation of R functions from C++
 Key: ARROW-17148
 URL: https://issues.apache.org/jira/browse/ARROW-17148
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


There are currently a few places where we call R code from C++ (and after 
ARROW-16444 and ARROW-16703 we will have some more where the overhead of 
calling into R might be greater than the time it takes to actually evaluate the 
function/the functions will be called in a tight loop).

The current approach uses {{cpp11::function}}. This is totally fine and safe 
but generates some ugly backtraces on error and is potentially slower than the 
lean-and-mean approach of purrr (whose entire job is to call R functions in a 
loop and has been heavily optimized). The purrr approach is to construct the 
{{call()}} and calling environment in advance and then just run `Rf_eval(call, 
env)` in the loop. This is both faster (fewer R API calls) and generates better 
backtraces (e.g., {{Error in fun(arg1, arg2)}} instead of {{Error in 
(function(a, b) { ...the whole content of the function ... })(every, deparsed, 
argument)}}.

Before optimizing that heavily we should of course benchmark to see exactly how 
much that matters!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

2022-07-20 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-17115:
---
Fix Version/s: 9.0.0

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> --
>
> Key: ARROW-17115
> URL: https://issues.apache.org/jira/browse/ARROW-17115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask

2022-07-20 Thread Matthew Roeschke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569124#comment-17569124
 ] 

Matthew Roeschke commented on ARROW-17134:
--

Ah okay that makes sense. When I read len(replacements) == number of true 
values in the mask, for some reason I thought "len(replacements)" meant the 
values could still be corresponding to the mask.

> We should maybe consider raising an error if the {{replacements}} are too 
> long?

That would be helpful, or maybe an example in the docstring could help clarify 
that point. Fine either way

> [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when 
> providing an array mask
> 
>
> Key: ARROW-17134
> URL: https://issues.apache.org/jira/browse/ARROW-17134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: arr1 = pa.array([1, 0, 1, None, None])
> In [3]: arr2 = pa.array([None, None, 1, 0, 1])
> In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], 
> arr2)
> Out[4]:
> 
> [
>   1,
>   0,
>   1,
>   null, # I would expect 0
>   null  # I would expect 1
> ]
> In [5]: pa.__version__
> Out[5]: '8.0.0'{code}
>  
> I have noticed this behavior occur with the integer, floating, bool, temporal 
> types
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc closed ARROW-17141.
--
Resolution: Not A Problem

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569123#comment-17569123
 ] 

Rok Mihevc commented on ARROW-17141:


:palmface: sorry, I didn't realise this was done already. Closing.


> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569122#comment-17569122
 ] 

Rok Mihevc commented on ARROW-17132:


Could this be related to this issue coming up in CI: 
[1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547]
 
[2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804]
 ?

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569106#comment-17569106
 ] 

Antoine Pitrou commented on ARROW-17093:


That said, https://github.com/bombela/backward-cpp could be better than nothing 
on Windows, or in the cases where a working debugger isn't available.

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569103#comment-17569103
 ] 

David Li commented on ARROW-17093:
--

That frankly sounds fairly reasonable. With gdb if we get it set up right we 
can get Python backtraces too where applicable.

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569102#comment-17569102
 ] 

Antoine Pitrou commented on ARROW-17093:


Really, I'm afraid the least unreasonable solution here is to script our CI to 
automatically find core dumps and script the debugger.


> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569101#comment-17569101
 ] 

Antoine Pitrou commented on ARROW-17093:


There is an ugly workaround suggested here: 
https://stackoverflow.com/questions/44900256/print-all-threads-stack-trace-of-a-process-in-c-c-on-linux-platform
bq. i use pthread_kill in one thread to send SIGUSR2 to other threads, when 
that threads receive the signal, it delivery to user defined signal handler 
function. In that function, use backtrace() to print the thread stack


> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569100#comment-17569100
 ] 

Antoine Pitrou commented on ARROW-17093:


Even the more sophisticated (and portable) 
https://github.com/bombela/backward-cpp seems limited to a single-thread 
backtrace.

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569097#comment-17569097
 ] 

Antoine Pitrou commented on ARROW-17093:


>From the looks of it, https://github.com/ianlancetaylor/libbacktrace doesn't 
>handle multi-threaded backtraces either.
And it seems libSegFault was removed from glibc: 
https://lists.gnu.org/archive/html/info-gnu/2022-02/msg2.html

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569094#comment-17569094
 ] 

Antoine Pitrou commented on ARROW-17093:


Ah... Unfortunately, the glibc backtrace support only prints a backtrace for 
the current thread, which makes it useless for the interesting cases :-(

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569092#comment-17569092
 ] 

Antoine Pitrou commented on ARROW-17093:


Edit: ARROW_WITH_BACKTRACE links with the glibc-specific backtrace support: 
https://www.gnu.org/software/libc/manual/html_node/Backtraces.html

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569090#comment-17569090
 ] 

David Li commented on ARROW-17093:
--

I _think_ what that does is use libbacktrace to get backtraces for assertions, 
but AIUI that library doesn't (automatically) install a fault handler the way 
libSegFault does.

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569088#comment-17569088
 ] 

Antoine Pitrou commented on ARROW-17093:


Interesting we have a little-known ARROW_WITH_BACKTRACE option that seems to 
link with https://github.com/ianlancetaylor/libbacktrace . I'm not sure it 
works, though?

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569084#comment-17569084
 ] 

Antoine Pitrou commented on ARROW-17093:


This could be useful to better diagnose tests which occasionally timeout on CI, 
for example by adding a trap-on-timeout facility:
{code:c++}
class TrapOnTimeoutGuard {
 public:
  explicit TrapOnTimeoutGuard(double seconds) {
auto fut = finished_;
bg_thread_ = std::thread([fut, seconds]() {
  if (!fut.Wait(seconds)) {
psnip_trap();
  }
});
  }

  ~TrapOnTimeoutGuard() {
finished_.MarkFinished();
bg_thread_.join();
  }

 private:
  Future<> finished_ = Future<>::Make();
  std::thread bg_thread_;
};
{code}

cc [~westonpace]

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17093) [C++][CI] Enable libSegFault for C++ tests

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569081#comment-17569081
 ] 

Antoine Pitrou commented on ARROW-17093:


cc [~assignUser]

> [C++][CI] Enable libSegFault for C++ tests
> --
>
> Key: ARROW-17093
> URL: https://issues.apache.org/jira/browse/ARROW-17093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: David Li
>Priority: Major
>
> Adding libSegFault.so could make it easier to diagnose CI failures. It will 
> print a backtrace on segfault.
> {noformat}
>   env SEGFAULT_SIGNALS=all \
>   LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so
> {noformat}
> This will give a backtrace like this on segfault:
> {noformat}
> Backtrace:
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
> /lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
> /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
> /tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]
> {noformat}
> Caveats:
>  * The path is OS-specific
>  * We could integrate it into the build tooling instead of doing it via env 
> var
>  * Are there easily accessible equivalents for MacOS and Windows we could use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569074#comment-17569074
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-17132 at 7/20/22 3:13 PM:
---

Without having looked into it, I think Neal's suggestion might solved the 
issue. 

Moreover, in the chunk above, I don't think {{tz = "UTC"}} is being used by 
{{as.Date()}} (it is an argument useful for the POSIXct method, for example, 
but not for the character method), but rather silently ignored. 


was (Author: dragosmg):
Without having looked into it, I think Neal's suggestion might do the trick. 

Moreover, in the chunk above, I don't think {{tz = "UTC"}} is being used by 
{{as.Date()}} (it is an argument useful for the POSIXct method, for example, 
but not for the character method), but rather silently ignored. 

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569077#comment-17569077
 ] 

David Li commented on ARROW-17141:
--

FieldPath is already in C++ so you just need to call that from Python, unless 
I'm misunderstanding?

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569078#comment-17569078
 ] 

David Li commented on ARROW-17141:
--

Also, in C++, we can't add a {{Table::field()}} because we can't depend on 
ARROW_COMPUTE being enabled.

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569074#comment-17569074
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-17132 at 7/20/22 3:05 PM:
---

Without having looked into it, I think Neal's suggestion might do the trick. 

Moreover, in the chunk above, I don't think {{tz = "UTC"}} is being used by 
{{as.Date()}} (it is an argument useful for the POSIXct method, for example, 
but not for the character method), but rather silently ignored. 


was (Author: dragosmg):
Without having looked into it, I think Neal's suggestion might do the trick. 

Moreover, in the chunk above, I don't think {{tz = "UTC"}} is being used by 
{{as.Date()}}, but rather silently ignored.

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569075#comment-17569075
 ] 

Rok Mihevc commented on ARROW-17141:


Wouldn't it make sense to have it in C++ to have it available everywhere?

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569074#comment-17569074
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-17132 at 7/20/22 3:00 PM:
---

Without having looked into it, I think Neal's suggestion might do the trick. 

Moreover, in the chunk above, I don't think {{tz = "UTC"}} is being used by 
{{as.Date()}}, but rather silently ignored.


was (Author: dragosmg):
Without having looked into it, I think Neal's suggestion might do the trick. 

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569074#comment-17569074
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17132:
--

Without having looked into it, I think Neal's suggestion might do the trick. 

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569073#comment-17569073
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17132:
--

What I mean by not specified in R: in the call above {{as.POSIXct()}} is called 
with {{tz = ""}} which is the default value - and the difference we notice 
stems from there. In order to achieve the same integer value for the date, in 
Arrow we assert the time zone. 
Objects of class {{Date}} do not store a timezone attribute in R. In the case 
above, {{tz = "UTC"}} is not passed to {{as.POSIXct()}}. When an object created 
with {{as.POSIXct(..., tz = "")}} gets printed with the local timezone, but 
this timezone is more of an artefact of the print method and isn't actually 
stored as attribute.
{code:r}
as.POSIXct("1999-12-31")
#> [1] "1999-12-31 GMT"
a <- as.POSIXct("1999-12-31")
attributes(a)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] ""
{code}

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16323) [Go] Implement Dictionary Scalars

2022-07-20 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-16323.
---
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13575
[https://github.com/apache/arrow/pull/13575]

> [Go] Implement Dictionary Scalars
> -
>
> Key: ARROW-16323
> URL: https://issues.apache.org/jira/browse/ARROW-16323
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569065#comment-17569065
 ] 

Neal Richardson commented on ARROW-17132:
-

Perhaps we can add a special R metadata field when we serialize to Arrow that 
says this should be a naive timestamp in R, and handle that in the Arrow-->R 
conversion?

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569060#comment-17569060
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17132:
--

Yep, I worked on ARROW-14442. I think what we are now seeing is a side-effect 
of that, because we are asserting a timezone when that is not specified in R. 

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17147) [R] parse_date_time should support locale parameter

2022-07-20 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-17147:
--

 Summary: [R] parse_date_time should support locale parameter
 Key: ARROW-17147
 URL: https://issues.apache.org/jira/browse/ARROW-17147
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Rok Mihevc


See [discussion 
here|https://github.com/apache/arrow/pull/13627#discussion_r924875872].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17146) [R] parse_date_time should support quiet = FALSE

2022-07-20 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-17146:
--

 Summary: [R] parse_date_time should support quiet = FALSE
 Key: ARROW-17146
 URL: https://issues.apache.org/jira/browse/ARROW-17146
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Rok Mihevc


See [discussion 
here|https://github.com/apache/arrow/pull/13627#discussion_r924875872].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569033#comment-17569033
 ] 

Antoine Pitrou commented on ARROW-16000:


So to be clear, I'm suggesting something like this:
https://gist.github.com/pitrou/2c99e971c01d324fdf72b632ffacae7d


> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17132) [R] Timezone handling in round-trip of POSIXct

2022-07-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17132:

Summary: [R] Timezone handling in round-trip of POSIXct  (was: [R] Mutate 
in compare_dplyr_binding returns wrong type)

> [R] Timezone handling in round-trip of POSIXct
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Mutate in compare_dplyr_binding returns wrong type

2022-07-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569030#comment-17569030
 ] 

Neal Richardson commented on ARROW-17132:
-

Right, transmute drops the input columns, so that would work around this. Note 
that this isn't about {{mutate()}} but rather the R <–> Arrow conversion, 
and/or how R deals with timezones, or timezone-naive data, or localized data, 
or something. 

 {code}
> expect_identical(as.data.frame(arrow_table(df)), df)
Error: as.data.frame(arrow_table(df)) (`actual`) not identical to `df` 
(`expected`).

`attr(actual$time, 'tzone')` is a character vector ('America/Los_Angeles')
`attr(expected$time, 'tzone')` is absent
{code}

I'm sure if we keep pulling on this, we'll end up back on some issue we've 
worked before about how R treats timestamps with no timezone as being local 
time but arrow reads as UTC, so we have to incorporate time zone information 
when converting from R to Arrow.

{code}
> attributes(as.data.frame(arrow_table(df))$time)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "America/Los_Angeles"

> attributes(df$time)
$class
[1] "POSIXct" "POSIXt" 
{code}


> [R] Mutate in compare_dplyr_binding returns wrong type
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569029#comment-17569029
 ] 

David Li commented on ARROW-17141:
--

Yeah, if the Python bindings convert names to indices that makes sense.

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569028#comment-17569028
 ] 

Rok Mihevc commented on ARROW-17141:


Oh, that's neater than I remember.

Would it make sense to add a field path utility here to enable alternative way 
of selecting? E.g.: field_path_to_indices("parent_field.child_field") -> 
\{parent_field_idx, child_field_idx}. It would be purely for syntactic sugar.

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Mutate in compare_dplyr_binding returns wrong type

2022-07-20 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569019#comment-17569019
 ] 

Rok Mihevc commented on ARROW-17132:


For the record I hit this when working on 
[https://github.com/apache/arrow/pull/13440|https://github.com/apache/arrow/pull/13440)]
 and got around it by using transmute instead of mutate so I'm not actively 
bothered by this.

> [R] Mutate in compare_dplyr_binding returns wrong type
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569018#comment-17569018
 ] 

Antoine Pitrou commented on ARROW-16000:


When using the CSV reader directly, no additional option is needed since you 
can pass whichever InputStream you want.
It's only with datasets that the user doesn't create the InputStream 
themselves, so some option must be made available to customize the stream 
wrapping strategy.

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17086) [C++] Install java/dataset include file and fix debug build failed by compiler flag

2022-07-20 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17086.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13614
[https://github.com/apache/arrow/pull/13614]

> [C++] Install java/dataset include file and fix debug build failed by 
> compiler flag
> ---
>
> Key: ARROW-17086
> URL: https://issues.apache.org/jira/browse/ARROW-17086
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: Jin Chengcheng
>Assignee: Jin Chengcheng
>Priority: Major
>  Labels: easyfix, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Arrow8 will install jni_util.h, but Arrow9 not, cause 
> [gluten|[oap-project/gluten 
> (github.com)|https://github.com/oap-project/gluten]] project fail when 
> upgraded to Arrow 9
> And if use cmake --preset ninja-debug
> -DARROW_BUILD_SHARED=ON
> -DARROW_GANDIVA_JAVA=ON
> -DARROW_GANDIVA=ON
> it will failed with this error, fix it
> arrow/cpp/src/gandiva/jni/expression_registry_helper.cc:157:78: error: 
> implicit conversion loses integer precision: 'unsigned long' to 'int' 
> [-Werror,-Wshorten-64-to-32]
> gandiva_data_types.SerializeToArray(reinterpret_cast(buffer.get()), 
> size);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17145) [C++] Compilation warnings on gcc in release mode

2022-07-20 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569016#comment-17569016
 ] 

Antoine Pitrou commented on ARROW-17145:


cc [~wesm]

> [C++] Compilation warnings on gcc in release mode
> -
>
> Key: ARROW-17145
> URL: https://issues.apache.org/jira/browse/ARROW-17145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> With gcc 10.3 I get this warning in release mode.
> {code}
> [168/321] Building CXX object 
> src/arrow/CMakeFiles/arrow_testing_objlib.dir/compute/exec/test_util.cc.o
> In file included from 
> /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.h:28,
>  from 
> /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.cc:18:
> /home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h: In member function 'R 
> arrow::internal::FnOnce::FnImpl::invoke(A&& ...) [with Fn = 
> arrow::Future<>::WrapResultyOnComplete::Callback::ThenOnComplete  
> arrow::AsyncGenerator
>  >)::, 
> arrow::Future<>::PassthruOnFailure  
> arrow::AsyncGenerator
>  >):: > > >; R = void; A = {const arrow::FutureImpl&}]':
> /home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h:177:21: warning: 
> '*((void*)(&)+8).arrow::compute::ExecBatch::length' may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>   177 | struct ARROW_EXPORT ExecBatch {
>   | ^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`

2022-07-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569015#comment-17569015
 ] 

Neal Richardson commented on ARROW-17143:
-

These tidyr functions aren't working on Arrow Tables, they're working on the R 
data.frame that read_json_arrow returns. I guess you could add something about 
this to the cookbook, but there's no real connection to anything in the package 
itself.

> [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
> ---
>
> Key: ARROW-17143
> URL: https://issues.apache.org/jira/browse/ARROW-17143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.1
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Related to ARROW-8813 ARROW-12099
> The arrow package can convert json files to data frames very easily, but 
> {{tidyr::unnest_longer}} is needed for array expansion.
> Wonder if {{tidyr}} could be added to the recommended package and examples 
> like this could be included in the documentation and test cases.
> {code:r}
> tf <- tempfile()
> on.exit(unlink(tf))
> writeLines('
> { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
> { "hello": 3.25, "world": null }
> { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
>   ', tf)
> arrow::read_json_arrow(tf) |>
>   tidyr::unnest(foo, names_sep = ".") |>
>   tidyr::unnest_longer(foo.bar)
> #> # A tibble: 6 × 3
> #>   hello world foo.bar
> #>   
> #> 1  3.5  FALSE   1
> #> 2  3.5  FALSE   2
> #> 3  3.25 NA NA
> #> 4  0TRUE3
> #> 5  0TRUE4
> #> 6  0TRUE5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17145) [C++] Compilation warnings on gcc in release mode

2022-07-20 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17145:
--

 Summary: [C++] Compilation warnings on gcc in release mode
 Key: ARROW-17145
 URL: https://issues.apache.org/jira/browse/ARROW-17145
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


With gcc 10.3 I get this warning in release mode.

{code}
[168/321] Building CXX object 
src/arrow/CMakeFiles/arrow_testing_objlib.dir/compute/exec/test_util.cc.o
In file included from 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.h:28,
 from 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/test_util.cc:18:
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h: In member function 'R 
arrow::internal::FnOnce::FnImpl::invoke(A&& ...) [with Fn = 
arrow::Future<>::WrapResultyOnComplete::Callback::ThenOnComplete
 >)::, 
arrow::Future<>::PassthruOnFailure
 >):: > > >; R = void; A = {const arrow::FutureImpl&}]':
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.h:177:21: warning: 
'*((void*)(&)+8).arrow::compute::ExecBatch::length' may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  177 | struct ARROW_EXPORT ExecBatch {
  | ^
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17132) [R] Mutate in compare_dplyr_binding returns wrong type

2022-07-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569011#comment-17569011
 ] 

Neal Richardson commented on ARROW-17132:
-

The assertion is failing on "time", your input variable, not the result of 
{{{}yday(){}}}. I don't recall exactly what all of the corners are around 
timezones in the R <–> Arrow conversion but I know there are several, 
[~dragosmg] may know better. There may be a more stable way of generating your 
input data.

> [R] Mutate in compare_dplyr_binding returns wrong type
> --
>
> Key: ARROW-17132
> URL: https://issues.apache.org/jira/browse/ARROW-17132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Rok Mihevc
>Priority: Minor
>  Labels: test
>
> The following:
> {code:r}
> df <- tibble::tibble(
>   time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
> as.Date("2001-01-01", tz = "UTC"), by = "day"))
> )
> compare_dplyr_binding(
>   .input %>%
> mutate(x = yday(time)) %>%
> collect(),
>   df
> )
> {code}
> Fails with:
> {code:bash}
> Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
> `object` (`actual`) not equal to `expected` (`expected`).
> `attr(actual$time, 'tzone')` is a character vector ('UTC')
> `attr(expected$time, 'tzone')` is absent
> Backtrace:
>  1. arrow:::compare_dplyr_binding(...)
>   at test-dplyr-funcs-datetime.R:574:2
>  2. arrow:::expect_equal(via_batch, expected, ...)
>   at tests/testthat/helper-expectation.R:115:4
>  3. testthat::expect_equal(...)
>   at tests/testthat/helper-expectation.R:42:4
> {code}
> This also happens for qday and probably other functions where input is 
> temporal and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Joost Hoozemans (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569009#comment-17569009
 ] 

Joost Hoozemans commented on ARROW-16000:
-

Thanks everyone for the advice. What makes CsvFragmentScanOptions the preferred 
place over csv.ReadOptions? CsvFragmentScanOptions right now doesn't directly 
store any properties itself, it only carries a csv.ConvertOptions and 
csv.ReadOptions. And compression and encoding sound like properties of a whole 
file, not a fragment (although I don't know if that is what Fragment means 
here).

Would it make sense as first attempt for me to add a TransformInputStream to 
CsvFragmentScanOptions or ReadOptions? Because then I can create 1 in python in 
the same way read_csv does it (with MakeTransformInputStream, with a callback 
into en/decode functions in a python library). Then we can see if there is a 
performance problem. Then later we could add functionality that creates a 
TransformInputStream in c++ world with a callback to some external library as 
Antoine suggested

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16612) [R] Support inferring compression from filename for all readers/writers

2022-07-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16612:

Summary: [R] Support inferring compression from filename for all 
readers/writers  (was: [R] parquet files with compression extensions should use 
parquet writer for compression)

> [R] Support inferring compression from filename for all readers/writers
> ---
>
> Key: ARROW-16612
> URL: https://issues.apache.org/jira/browse/ARROW-16612
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Sam Albers
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now arrow will silently write a file with a .gz extension to 
> CompressedOutputStream rather than passing the compression option to the 
> parquet writer itself. The internal detect_compression() function detects the 
> extension and that is what passes it off incorrectly. However it only fails 
> at the read_parquet stage which could lead to confusion. 
> {code:java}
> library(arrow, warn.conflicts = FALSE) 
> tf <- tempfile(fileext = ".parquet.gz") 
> write_parquet(data.frame(x = 1:5), tf, compression = "gzip", 
> compression_level = 5) read_parquet(tf) 
> #> Error: file must be a "RandomAccessFile"{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-07-20 Thread Joost Hoozemans (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joost Hoozemans reassigned ARROW-16000:
---

Assignee: Joost Hoozemans

> [C++][Dataset] Support Latin-1 encoding
> ---
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Joost Hoozemans
>Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17142:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kshiteej K
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14855) [R] build_expr() should check that non-expression inputs have vec_size() == 1L

2022-07-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-14855:


Assignee: Dragoș Moldovan-Grünfeld

> [R] build_expr() should check that non-expression inputs have vec_size() == 1L
> --
>
> Key: ARROW-14855
> URL: https://issues.apache.org/jira/browse/ARROW-14855
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> What I’m trying to do is error to prevent code like this from working (since 
> row order isn’t guaranteed in Arrow but is in R): 
> {code:R}
> # remotes::install_github("apache/arrow/r#11690")
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> record_batch(a = c("something1", "something2")) %>% 
>   mutate(df_col = data.frame(a, b = c("other1", "other2")))
> #> InMemoryDataset (query)
> #> a: string
> #> df_col: struct> ({a=a, b=...})
> #> 
> #> See $.data for the source Arrow object
> tibble(a = c("something1", "something2")) %>% 
>   mutate(df_col = data.frame(a, b = c("other1", "other2"))) %>% 
>   arrow:::arrow_dplyr_query()
> #> InMemoryDataset (query)
> #> a: string
> #> df_col: struct
> #> 
> #> See $.data for the source Arrow object
> {code}
> This shows up elsewhere too with a confusing error: 
> {code:R}
> record_batch(a = 1:2) %>% mutate(a + 3:4)
> #> Error: NotImplemented: Function add_checked has no kernel matching input 
> types (array[int32], scalar[list])
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:340
>   call.function->DispatchBest()
> {code}
> I think we need slightly different rules than {{Scalar$create()}} uses when 
> interpreting user expressions, since we want to error rather than wrap values 
> that aren’t {{vctrs::vec_size() != 1}} in {{list()}} (thus changing the type 
> that the user specified). 
> Relevant section of {{build_expr()}}: 
> 
>  
> Relevant section of {{Scalar$create()}}: 
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12322) [R] Work around masking of data type functions

2022-07-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-12322:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Work around masking of data type functions
> --
>
> Key: ARROW-12322
> URL: https://issues.apache.org/jira/browse/ARROW-12322
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Minor
>
> There are more than two dozen data type functions in the arrow package, and 
> they are named very generically, so they represent a large surface area for 
> potential masking problems, which are likely to occur in user environments, 
> not in our CI. If these masking errors do occur, they will probably give 
> frustratingly unhelpful error messages. This happened to me with 
> {{rlang::string()}}. The error was:
> {quote}Error in is_integerish\(x\) : argument "x" is missing, with no default
> {quote}
> This can be worked around with some non-standard eval magic.
> I implemented a working version of this in 
> [https://github.com/apache/arrow/pull/9952] but we removed it before merging 
> that PR because there were questions about whether there was a better way to 
> implement it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15011) [R] Can we (semi?) automatically document when a binding exists

2022-07-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-15011:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Can we (semi?) automatically document when a binding exists
> ---
>
> Key: ARROW-15011
> URL: https://issues.apache.org/jira/browse/ARROW-15011
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> We don't want to (re)write the documentation for each binding that exists, 
> but could we use templates or other automated ways of documenting "This 
> binding should work just like X from package Y" when that's true, and then 
> have a place to put some of the exceptions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12093) [R] Throw helpful errors on bad object types in dplyr expressions

2022-07-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-12093:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Throw helpful errors on bad object types in dplyr expressions
> -
>
> Key: ARROW-12093
> URL: https://issues.apache.org/jira/browse/ARROW-12093
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Ian Cook
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When users pass bad dots args to {{mutate()}} and other dplyr verbs (such 
> that they could not succeed even if the data were pulled into R first), they 
> should get immediate, informative errors.
> An example:
> {code:java}
> mtcars %>% Table$create() %>% mutate(c){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568991#comment-17568991
 ] 

David Li commented on ARROW-17141:
--

Presumably a binding to the {{struct_field}} kernel could do this

> [C++] Enable selecting nested fields in StructArray with field path
> ---
>
> Key: ARROW-17141
> URL: https://issues.apache.org/jira/browse/ARROW-17141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> Currently selecting a nested field in a StructArray requires multiple selects 
> or flattening of schema. It would be more user friendly to provide a field 
> path e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16774) [C++] Create Filter Kernel on RLE data

2022-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16774:
---
Labels: pull-request-available  (was: )

> [C++] Create Filter Kernel on RLE data
> --
>
> Key: ARROW-16774
> URL: https://issues.apache.org/jira/browse/ARROW-16774
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16781) [C++] Complete RunLengthEncoded type

2022-07-20 Thread Tobias Zagorni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Zagorni reassigned ARROW-16781:
--

Assignee: Tobias Zagorni

> [C++] Complete RunLengthEncoded type
> 
>
> Key: ARROW-16781
> URL: https://issues.apache.org/jira/browse/ARROW-16781
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Tobias Zagorni
>Assignee: Tobias Zagorni
>Priority: Major
>
> We currently have a RunLengthEncoded DataType class that is good enough to 
> RLE ArrayData and ArraySpan instances, and dispatch Kernels based on it, with 
> not too much functionally beyond that. This task is to implement the regular 
> Arrow C++ functionality you would expect to work on all data types:
>  * Corresponding Array type
>  * Corresponding Array Builder
>  * type traits
>  * make_array() should work
>  * Validate() / ValidateFull() passes
>  * type tests pass
>  * ...?
> To me these all seem pretty entangled with each other, but if you find a way 
> to split this into multiple tasks, feel free to do so.
> The basic functionality is included in 
> [https://github.com/apache/arrow/pull/13330]. So this PR can be based upon 
> that branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17137) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

2022-07-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568984#comment-17568984
 ] 

Joris Van den Bossche commented on ARROW-17137:
---

[~SimonCW] thanks for the report! I can confirm the error on the latest master 
branch as well (on Linux).

> [Python] Converting data frame to Table with large nested column fails 
> `Invalid Struct child array has length smaller than expected`
> 
>
> Key: ARROW-17137
> URL: https://issues.apache.org/jira/browse/ARROW-17137
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Simon Weiß
>Priority: Major
>  Labels: python-conversion
>
> Hey, 
> I have a data frame for which one column is a nested struct array. Converting 
> it to a pyarrow.Table fails if the data frame gets too big. I could reproduce 
> the bug with a minimal example with anonymized data that is roughly similar 
> to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> N_ROWS = 800_000
> item_record = {
>     "someImportantAssets": [
>         {
>             "square": 
> "https://some.super.loong.link.com/withmany/lorem/upload/;
>             
> "ipsum/stilllooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
>             "/specificLink-i15348891"
>         }
>     ],
>     "id": "i15348891",
>     "title": "Some Long Item Title i15348891",
> }
> user_record = {
>     "userId": "faa4648-4964drf-64648fafa648-4648falj",
>     "data": [item_record for _ in range(24)],
> }
> df = pd.DataFrame([user_record for _ in range(N_ROWS)])
> table = pa.Table.from_pandas(df){code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "/.../scratch/experiment_pq.py", line 23, in 
>     table = pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
>   File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
>   File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #1 invalid: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (13338407 < 13338408) {code}
> The length is always smaller than expected by 1.
>  
> h2. Expected behavior:
> Run without errors or fail with a better error message.
>  
> h2. System Info and Versions:
> Apple M1 Pro but also happened on amd64 Linux machine on AWS
>  
> {code:java}
> arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
> pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
> python                    3.9.7           h54d631c_3_cpython    conda-forge
> {code}
> I could also reproduce with
> {noformat}
>  pyarrow 8.0.0{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17137) [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

2022-07-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17137:
--
Labels: python-conversion  (was: )

> [Python] Converting data frame to Table with large nested column fails 
> `Invalid Struct child array has length smaller than expected`
> 
>
> Key: ARROW-17137
> URL: https://issues.apache.org/jira/browse/ARROW-17137
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Simon Weiß
>Priority: Major
>  Labels: python-conversion
>
> Hey, 
> I have a data frame for which one column is a nested struct array. Converting 
> it to a pyarrow.Table fails if the data frame gets too big. I could reproduce 
> the bug with a minimal example with anonymized data that is roughly similar 
> to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> N_ROWS = 800_000
> item_record = {
>     "someImportantAssets": [
>         {
>             "square": 
> "https://some.super.loong.link.com/withmany/lorem/upload/;
>             
> "ipsum/stilllooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
>             "/specificLink-i15348891"
>         }
>     ],
>     "id": "i15348891",
>     "title": "Some Long Item Title i15348891",
> }
> user_record = {
>     "userId": "faa4648-4964drf-64648fafa648-4648falj",
>     "data": [item_record for _ in range(24)],
> }
> df = pd.DataFrame([user_record for _ in range(N_ROWS)])
> table = pa.Table.from_pandas(df){code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "/.../scratch/experiment_pq.py", line 23, in 
>     table = pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
>   File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
>   File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array 
> invalid: Invalid: Struct child array #1 invalid: Invalid: List child array 
> invalid: Invalid: Struct child array #0 has length smaller than expected for 
> struct array (13338407 < 13338408) {code}
> The length is always smaller than expected by 1.
>  
> h2. Expected behavior:
> Run without errors or fail with a better error message.
>  
> h2. System Info and Versions:
> Apple M1 Pro but also happened on amd64 Linux machine on AWS
>  
> {code:java}
> arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
> pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
> python                    3.9.7           h54d631c_3_cpython    conda-forge
> {code}
> I could also reproduce with
> {noformat}
>  pyarrow 8.0.0{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask

2022-07-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568983#comment-17568983
 ] 

Joris Van den Bossche commented on ARROW-17134:
---

The replacement array isn't expected to be of the same shape as the input/mask 
arrays (where the corresponding values would get replaced), but it's only the 
values that are actually placed in the new array (so len(replacements) == 
number of true values in the mask). 
So given that your {{arr2}} starts with two nulls, it are those two values that 
are put in the result. 

Comparing to numpy, it has thus the similar behaviour as {{setitem}} 
({{arr[mask] = replacements}}), and not like {{np.putmask}} (where values and 
replacements have the same shape) 

We should maybe consider raising an error if the {{replacements}} are too long? 

The case where you want to use the corresponding (same location) values of 
values vs replacements, for that case I think one can use {{pc.if_else(mask, 
replacements, values)}}. Using your example:

{code}
In [13]: pc.if_else([False, False, False, True, True], arr2, arr1)
Out[13]: 

[
  1,
  0,
  1,
  0,
  1
]
{code}





> [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when 
> providing an array mask
> 
>
> Key: ARROW-17134
> URL: https://issues.apache.org/jira/browse/ARROW-17134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: arr1 = pa.array([1, 0, 1, None, None])
> In [3]: arr2 = pa.array([None, None, 1, 0, 1])
> In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], 
> arr2)
> Out[4]:
> 
> [
>   1,
>   0,
>   1,
>   null, # I would expect 0
>   null  # I would expect 1
> ]
> In [5]: pa.__version__
> Out[5]: '8.0.0'{code}
>  
> I have noticed this behavior occur with the integer, floating, bool, temporal 
> types
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17144) Adding sqrt Function

2022-07-20 Thread Sahaj Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahaj Gupta reassigned ARROW-17144:
---

Assignee: Sahaj Gupta

> Adding sqrt Function
> 
>
> Key: ARROW-17144
> URL: https://issues.apache.org/jira/browse/ARROW-17144
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Sahaj Gupta
>Assignee: Sahaj Gupta
>Priority: Major
>
> Adding Sqrt Function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17144) Adding sqrt Function

2022-07-20 Thread Sahaj Gupta (Jira)
Sahaj Gupta created ARROW-17144:
---

 Summary: Adding sqrt Function
 Key: ARROW-17144
 URL: https://issues.apache.org/jira/browse/ARROW-17144
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sahaj Gupta


Adding Sqrt Function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17125) Unable to install pyarrow on Debian 10 (i686)

2022-07-20 Thread Rustam Guliev (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568981#comment-17568981
 ] 

Rustam Guliev commented on ARROW-17125:
---

So I talked to our IT dep and they agreed to update it to Debian 11 with 
64-bit. Hopefully, that would solve my issue. 

> Unable to install pyarrow on Debian 10 (i686)
> -
>
> Key: ARROW-17125
> URL: https://issues.apache.org/jira/browse/ARROW-17125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.1, 8.0.1
> Environment: Debian GNU/Linux 10 (buster)
> Python 3.9.7
> pip 22.1.2 
> cmake 3.22.5
> $ lscpu
> Architecture:        i686
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> Address sizes:       45 bits physical, 48 bits virtual
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  1
> Core(s) per socket:  1
> Socket(s):           4
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               45
> Model name:          Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> Stepping:            7
> CPU MHz:             1995.000
> BogoMIPS:            3990.00
> Hypervisor vendor:   VMware
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            20480K
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss nx rdtscp lm constant_tsc 
> arch_perfmon xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 
> cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor 
> lahf_lm pti ssbd ibrs ibpb stibp tsc_adjust arat md_clear flush_l1d 
> arch_capabilities  
>Reporter: Rustam Guliev
>Priority: Major
>
> Hi,
> I am not able to install pyarrow on Debian 10. First, the installation (via 
> `pip` or `poetry install`) fails with the following:
>  
> {code:java}
>   EnvCommandError  Command 
> ['/home/rustam/.cache/pypoetry/virtualenvs/spectra-annotator-Vr_f9e53-py3.9/bin/pip',
>  'install', '--no-deps', 
> 'file:///home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz']
>  errored with the following return code 1, and output:
>   Processing 
> /home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz
>     Installing build dependencies: started
>     Installing build dependencies: finished with status 'done'
>     Getting requirements to build wheel: started
>     Getting requirements to build wheel: finished with status 'done'
>     Preparing metadata (pyproject.toml): started
>     Preparing metadata (pyproject.toml): finished with status 'done'
>   Building wheels for collected packages: pyarrow
>     Building wheel for pyarrow (pyproject.toml): started
>     Building wheel for pyarrow (pyproject.toml): finished with status 'error'
>     error: subprocess-exited-with-error    × Building wheel for pyarrow 
> (pyproject.toml) did not run successfully.
>     │ exit code: 1
>     ╰─> [261 lines of output]
>         running bdist_wheel
>         running build
>         running build_py
>         running egg_info
>         writing pyarrow.egg-info/PKG-INFO
>         writing dependency_links to pyarrow.egg-info/dependency_links.txt
>         writing entry points to pyarrow.egg-info/entry_points.txt
>         writing requirements to pyarrow.egg-info/requires.txt
>         writing top-level names to pyarrow.egg-info/top_level.txt
>         listing git files failed - pretending there aren't any
>         reading manifest file 'pyarrow.egg-info/SOURCES.txt'
>         reading manifest template 'MANIFEST.in'
>         warning: no files found matching '../LICENSE.txt'
>         warning: no files found matching '../NOTICE.txt'
>         warning: no previously-included files matching '*.so' found anywhere 
> in distribution
>         warning: no previously-included files matching '*.pyc' found anywhere 
> in distribution
>         warning: no previously-included files matching '*~' found anywhere in 
> distribution
>         warning: no previously-included files matching '#*' found anywhere in 
> distribution
>         warning: no previously-included files matching '.git*' found anywhere 
> in distribution
>         warning: no previously-included files matching '.DS_Store' found 
> anywhere in distribution
>         no previously-included directories found matching '.asv'
>         
> /tmp/pip-build-env-umvxn44o/overlay/lib/python3.9/site-packages/setuptools/command/build_py.py:153:
>  SetuptoolsDeprecationWarning:     Installing 'pyarrow.includes' as data is 
> deprecated, please list it in `packages`.
>             !!
>             

[jira] [Updated] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17142:
--
Component/s: Python

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kshiteej K
>Priority: Major
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17142:
--
Labels: good-first-issue  (was: )

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kshiteej K
>Priority: Major
>  Labels: good-first-issue
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17142:
--
Summary: [Python] Parquet FileMetadata.equals() method segfaults when 
passed None  (was: `equals` method on Parquet Metadata segfaults when passed 
`None)

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Kshiteej K
>Priority: Major
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`

2022-07-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17143:
--
Description: 
Related to ARROW-8813 ARROW-12099

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}

  was:
Related to ARROW-8813

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}


> [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
> ---
>
> Key: ARROW-17143
> URL: https://issues.apache.org/jira/browse/ARROW-17143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.1
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Related to ARROW-8813 ARROW-12099
> The arrow package can convert json files to data frames very easily, but 
> {{tidyr::unnest_longer}} is needed for array expansion.
> Wonder if {{tidyr}} could be added to the recommended package and examples 
> like this could be included in the documentation and test cases.
> {code:r}
> tf <- tempfile()
> on.exit(unlink(tf))
> writeLines('
> { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
> { "hello": 3.25, "world": null }
> { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
>   ', tf)
> arrow::read_json_arrow(tf) |>
>   tidyr::unnest(foo, names_sep = ".") |>
>   tidyr::unnest_longer(foo.bar)
> #> # A tibble: 6 × 3
> #>   hello world foo.bar
> #>   
> #> 1  3.5  FALSE   1
> #> 2  3.5  FALSE   2
> #> 3  3.25 NA NA
> #> 4  0TRUE3
> #> 5  0TRUE4
> #> 6  0TRUE5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`

2022-07-20 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17143:
-

 Summary: [R] Add examples working with `tidyr::unnest`and 
`tidyr::unnest_longer`
 Key: ARROW-17143
 URL: https://issues.apache.org/jira/browse/ARROW-17143
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 8.0.1
Reporter: SHIMA Tatsuya


Related to ARROW-8813

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17142) `equals` method on Parquet Metadata segfaults when passed `None

2022-07-20 Thread Kshiteej K (Jira)
Kshiteej K created ARROW-17142:
--

 Summary: `equals` method on Parquet Metadata segfaults when passed 
`None
 Key: ARROW-17142
 URL: https://issues.apache.org/jira/browse/ARROW-17142
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Kshiteej K


 
{code:java}
import pyarrow as pa import pyarrow.parquet as pq

table = pa.table({"a": [1, 2, 3]}) 

# Here metadata is None
metadata = table.schema.metadata

fname = "data.parquet"
pq.write_table(table, fname) # Get `metadata`.
r_metadata = pq.read_metadata(fname)

# Equals on Metadata segfaults when passed None
r_metadata.equals(metadata) {code}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17140) Adding Floor Function

2022-07-20 Thread Sahaj Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahaj Gupta closed ARROW-17140.
---
Resolution: Fixed

> Adding Floor Function
> -
>
> Key: ARROW-17140
> URL: https://issues.apache.org/jira/browse/ARROW-17140
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Sahaj Gupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Adding Floor Function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17067) Implement Substring_Index

2022-07-20 Thread Sahaj Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahaj Gupta closed ARROW-17067.
---
Resolution: Done

> Implement Substring_Index
> -
>
> Key: ARROW-17067
> URL: https://issues.apache.org/jira/browse/ARROW-17067
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Sahaj Gupta
>Assignee: Sahaj Gupta
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Adding Substring_index Function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17141) [C++] Enable selecting nested fields in StructArray with field path

2022-07-20 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-17141:
--

 Summary: [C++] Enable selecting nested fields in StructArray with 
field path
 Key: ARROW-17141
 URL: https://issues.apache.org/jira/browse/ARROW-17141
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


Currently selecting a nested field in a StructArray requires multiple selects 
or flattening of schema. It would be more user friendly to provide a field path 
e.g.: field_in_top_struct.field_in_substruct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >