[jira] [Created] (ARROW-15529) [C++] Add rows scanned to open telemetry / profiling

2022-02-02 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15529:
--

 Summary: [C++] Add rows scanned to open telemetry / profiling 
 Key: ARROW-15529
 URL: https://issues.apache.org/jira/browse/ARROW-15529
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jonathan Keane


It's not described at https://duckdb.org/dev/profiling but this is included in 
DuckDB's profiling and was helpful in looking at their scans to see the 
benefits of predicate push down



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15539) [Archery] Add ARROW_JEMALLOC to build options

2022-02-02 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15539.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12325
[https://github.com/apache/arrow/pull/12325]

> [Archery] Add ARROW_JEMALLOC to build options
> -
>
> Key: ARROW-15539
> URL: https://issues.apache.org/jira/browse/ARROW-15539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Elena Henderson
>Assignee: Elena Henderson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
>  
> {code:java}
> $ export ARROW_JEMALLOC=OFF
> $ archery benchmark run --repetitions 1 
> -- Building using CMake version: 3.21.3
> -- The C compiler identification is Clang 11.1.0
> -- The CXX compiler identification is Clang 11.1.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: 
> /Users/voltrondata/miniconda3/envs/arrow-commit/bin/arm64-apple-darwin20.0.0-clang
>  - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /Users/voltrondata/miniconda3/envs/arrow-commit/bin/arm64-apple-darwin20.0.0-clang++
>  - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 7.0.0 (full: '7.0.0-SNAPSHOT')
> -- Arrow SO version: 700 (full: 700.0.0)
> -- clang-tidy 12 not found
> -- clang-format 12 not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
> -- infer not found
> -- Found Python3: 
> /Users/voltrondata/miniconda3/envs/arrow-commit/bin/python3.8 (found version 
> "3.8.12") found components: Interpreter 
> -- Using ccache: /Users/voltrondata/miniconda3/envs/arrow-commit/bin/ccache
> -- Found cpplint executable at 
> /opt/homebrew/var/buildkite-agent/builds/test-mac-arm/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/arrow/cpp/build-support/cpplint.py
> -- System processor: arm64
> -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
> -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success
> -- Arrow build warning level: PRODUCTION
> Configured for RELEASE build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: RELEASE
> -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT
> -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT - Failed
> -- Using CONDA approach to find dependencies
> -- Using CONDA_PREFIX for ARROW_PACKAGE_PREFIX: 
> /Users/voltrondata/miniconda3/envs/arrow-commit
> -- Setting (unset) dependency *_ROOT variables: 
> /Users/voltrondata/miniconda3/envs/arrow-commit
> -- ARROW_ABSL_BUILD_VERSION: 20210324.2
> -- ARROW_ABSL_BUILD_SHA256_CHECKSUM: 
> 59b862f50e710277f8ede96f083a5bb8d7c9595376146838b9580be90374ee1f
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.133
> -- ARROW_AWSSDK_BUILD_SHA256_CHECKSUM: 
> d6c495bc06be5e21dac716571305d77437e7cfd62a2226b8fe48d9ab5785a8d6
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.12
> -- ARROW_AWS_CHECKSUMS_BUILD_SHA256_CHECKSUM: 
> 394723034b81cc7cd528401775bc7aca2b12c7471c92350c80a0e2fb9d2909fe
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.6.9
> -- ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM: 
> 928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM: 
> f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4
> -- ARROW_BOOST_BUILD_VERSION: 1.75.0
> -- ARROW_BOOST_BUILD_SHA256_CHECKSUM: 
> 267e04a7c0bfe85daf796dedc789c3a27a76707e1c968f0a2a87bb96331e2b61
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.9
> -- ARROW_BROTLI_BUILD_SHA256_CHECKSUM: 
> f9e8d81d0405ba66d181529af42a3354f838c939095ff99930da6aa9cdf6fe46
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_BZIP2_BUILD_SHA256_CHECKSUM: 
> ab5a03176ee106d3f0fa90e381da478ddae405918153cca248e682cd0c4a2269
> -- ARROW_CARES_BUILD_VERSION: 1.17.2
> -- ARROW_CARES_BUILD_SHA256_CHECKSUM: 
> 4803c844ce20ce510ef0eb83f8ea41fa24ecaae9d280c468c582d2bb25b3913d
> -- ARROW_CRC32C_BUILD_VERSION: 1.1.2
> -- ARROW_CRC32C_BUILD_SHA256_CHECKSUM: 
> ac07840513072b7fcebda6e821068aa04889018f24e10e46181068fb214d7e56
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.6.0
> -- ARROW_GBENCHMARK_BUILD_SHA256_CHECKSUM: 
> 1f71c72ce08d2c1310011ea6436b31e39ccab8c2db94186d26657d41747c85d6
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GFLAGS_BUILD_SHA256_CHECKSUM: 
> 34af2f15cf7367513b352bdcd2493ab14ce43692d2dcd9dfc499492966c64dcf
> -- ARROW_GLOG_BUILD_VERSION: v0.5.0
> -- ARROW_GLOG_BUILD_SHA256_CHECKSUM: 
> eede71f28371bf39aa69b45de23b329d37214016e2055269b3b5e7cf

[jira] [Resolved] (ARROW-15480) [R] Expand on schema/colnames mismatch error messages

2022-02-03 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15480.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12277
[https://github.com/apache/arrow/pull/12277]

> [R] Expand on schema/colnames mismatch error messages
> -
>
> Key: ARROW-15480
> URL: https://issues.apache.org/jira/browse/ARROW-15480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In ARROW-14744 extra checks were added for when {{open_dataset()}} is used 
> and there are conflicts between column names from the schema vs. passed in 
> explicitly - we should expand on the messaging and tests for the different 
> possible cases here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14169) [R] altrep for factors

2022-02-03 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14169.

Resolution: Fixed

Issue resolved by pull request 11738
[https://github.com/apache/arrow/pull/11738]

> [R] altrep for factors
> --
>
> Key: ARROW-14169
> URL: https://issues.apache.org/jira/browse/ARROW-14169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15605:
--

 Summary: [CI] [R] Keep using old macos runners on our autobrew CI 
job
 Key: ARROW-15605
 URL: https://issues.apache.org/jira/browse/ARROW-15605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15605:
--

Assignee: Jonathan Keane

> [CI] [R] Keep using old macos runners on our autobrew CI job
> 
>
> Key: ARROW-15605
> URL: https://issues.apache.org/jira/browse/ARROW-15605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14745) [R] Enable true duckdb streaming

2022-02-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14745.

Resolution: Fixed

Issue resolved by pull request 11730
[https://github.com/apache/arrow/pull/11730]

> [R] Enable true duckdb streaming
> 
>
> Key: ARROW-14745
> URL: https://issues.apache.org/jira/browse/ARROW-14745
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15570) [CI][Nightly] Drop centos-8 R nightly job

2022-02-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15570.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12337
[https://github.com/apache/arrow/pull/12337]

> [CI][Nightly] Drop centos-8 R nightly job
> -
>
> Key: ARROW-15570
> URL: https://issues.apache.org/jira/browse/ARROW-15570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It has started failing since CentOS 8 went EOL. Followup to ARROW-15038, 
> which removed all of the other CentOS 8 jobs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15606) [CI] [R] Add brew release build

2022-02-07 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15606:
--

 Summary: [CI] [R] Add brew release build
 Key: ARROW-15606
 URL: https://issues.apache.org/jira/browse/ARROW-15606
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package

2022-02-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15606:
---
Summary: [CI] [R] Add brew build that exercises the R package  (was: [CI] 
[R] Add brew release build)

> [CI] [R] Add brew build that exercises the R package
> 
>
> Key: ARROW-15606
> URL: https://issues.apache.org/jira/browse/ARROW-15606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15605.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12363
[https://github.com/apache/arrow/pull/12363]

> [CI] [R] Keep using old macos runners on our autobrew CI job
> 
>
> Key: ARROW-15605
> URL: https://issues.apache.org/jira/browse/ARROW-15605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15616) [R] [CI] Fail the build if there is a documentation mismatch?

2022-02-08 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15616:
--

 Summary: [R] [CI] Fail the build if there is a documentation 
mismatch?
 Key: ARROW-15616
 URL: https://issues.apache.org/jira/browse/ARROW-15616
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane


We want to make sure that we are documentating as we go. One possible solution 
is to add a CI job that fails the build if rogygenize() writes anything new out 
(h/t Neal)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15020) [R] Add bindings for new dataset writing options

2022-02-08 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15020.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12185
[https://github.com/apache/arrow/pull/12185]

> [R] Add bindings for new dataset writing options
> 
>
> Key: ARROW-15020
> URL: https://issues.apache.org/jira/browse/ARROW-15020
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Added a child task to do R bindings separately. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15654) TPC-H Data Generator Node

2022-02-10 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490418#comment-17490418
 ] 

Jonathan Keane commented on ARROW-15654:


ARROW-3998 also exists and has some discussion there — any reason to not use 
that to track this work?

> TPC-H Data Generator Node
> -
>
> Key: ARROW-15654
> URL: https://issues.apache.org/jira/browse/ARROW-15654
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>
> To aid with benchmarking and profiling the engine on TPC-H, I'd like to build 
> an arrow-native TPC-H data generator scan node, and then later implement exec 
> plans for each TPC-H query in a benchmark program.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15654) TPC-H Data Generator Node

2022-02-10 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490418#comment-17490418
 ] 

Jonathan Keane edited comment on ARROW-15654 at 2/10/22, 6:18 PM:
--

ARROW-3998 also exists and has some discussion there — any reason to not use 
that to track this work? I'm happy to do the jira work to make that happen if 
we want to track it there, of course!


was (Author: jonkeane):
ARROW-3998 also exists and has some discussion there — any reason to not use 
that to track this work?

> TPC-H Data Generator Node
> -
>
> Key: ARROW-15654
> URL: https://issues.apache.org/jira/browse/ARROW-15654
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Sasha Krassovsky
>Assignee: Sasha Krassovsky
>Priority: Major
>
> To aid with benchmarking and profiling the engine on TPC-H, I'd like to build 
> an arrow-native TPC-H data generator scan node, and then later implement exec 
> plans for each TPC-H query in a benchmark program.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15655) [R] Find a way to make the size of our vignettes smaller

2022-02-10 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15655:
--

 Summary: [R] Find a way to make the size of our vignettes smaller 
 Key: ARROW-15655
 URL: https://issues.apache.org/jira/browse/ARROW-15655
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane


Or move them such that they are on our website, but not shipped with the source 
package.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15656) [R] Valgrind error with C-data interface

2022-02-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Description: 
This is currently failing on our valgrind nightly:

{code}
==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

though likely was from 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0

  was:
This is currently failing on our valgrind nightly:

{code}

==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

though likely was from 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0


> [R] Valgrind error with C-data interface
> 
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jonathan Keane
>Priority: Major
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: 

[jira] [Created] (ARROW-15656) [R] Valgrind error with C-data interface

2022-02-10 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15656:
--

 Summary: [R] Valgrind error with C-data interface
 Key: ARROW-15656
 URL: https://issues.apache.org/jira/browse/ARROW-15656
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jonathan Keane


This is currently failing on our valgrind nightly:

{code}

==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

though likely was from 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15656) [R] Valgrind error with C-data interface

2022-02-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Description: 
This is currently failing on our valgrind nightly:

{code}
==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

Though it could be from: 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 
which added some code to make a source node from the C-Data interface.

However, the first call looks like it might be the line 
https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
 

  was:
This is currently failing on our valgrind nightly:

{code}
==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

though likely was from 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0


> [R] Valgrind error with C-data interface
> 
>
> 

[jira] [Updated] (ARROW-15656) [C++] [R] Valgrind error with C-data interface

2022-02-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Summary: [C++] [R] Valgrind error with C-data interface  (was: [R] Valgrind 
error with C-data interface)

> [C++] [R] Valgrind error with C-data interface
> --
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jonathan Keane
>Priority: Major
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: R_execClosure (eval.c:1918)
> ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
> ==10301==  Uninitialised value was created by a heap allocation
> ==10301==at 0x483E0F0: memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0x483E212: posix_memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0xF4756DF: arrow::(anonymous 
> namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
> (memory_pool.cc:365)
> ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) 
> (memory_pool.cc:557)
> ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
> ==10301==by 0xF041EC2: arrow::Status 
> GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1} const&) (memorypool.cpp:46)
> ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
> (memorypool.cpp:28)
> ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) 
> (memory_pool.cc:921)
> ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
> (memory_pool.cc:945)
> ==10301==by 0xF478A74: ResizePoolBuffer, 
> std::unique_ptr > (memory_pool.cc:984)
> ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
> (memory_pool.cc:992)
> ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
> (buffer.cc:174)
> ==10301==by 0xF38CC77: arrow::(anonymous 
> namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > 
> const&, arrow::MemoryPool*, std::shared_ptr*) 
> (concatenate.cc:81)
> ==10301== 
>   test-dataset.R:852:3 [success]
> {code}
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> It surfaced with 
> https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0
> Though it could be from: 
> https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0
>  which added some code to make a source node from the C-Data interface.
> However, the first call looks like it might be the line 
> https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15656) [C++] [R] Valgrind error with C-data interface

2022-02-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Component/s: C++
 R

> [C++] [R] Valgrind error with C-data interface
> --
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: R_execClosure (eval.c:1918)
> ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
> ==10301==  Uninitialised value was created by a heap allocation
> ==10301==at 0x483E0F0: memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0x483E212: posix_memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0xF4756DF: arrow::(anonymous 
> namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
> (memory_pool.cc:365)
> ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) 
> (memory_pool.cc:557)
> ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
> ==10301==by 0xF041EC2: arrow::Status 
> GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1} const&) (memorypool.cpp:46)
> ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
> (memorypool.cpp:28)
> ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) 
> (memory_pool.cc:921)
> ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
> (memory_pool.cc:945)
> ==10301==by 0xF478A74: ResizePoolBuffer, 
> std::unique_ptr > (memory_pool.cc:984)
> ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
> (memory_pool.cc:992)
> ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
> (buffer.cc:174)
> ==10301==by 0xF38CC77: arrow::(anonymous 
> namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > 
> const&, arrow::MemoryPool*, std::shared_ptr*) 
> (concatenate.cc:81)
> ==10301== 
>   test-dataset.R:852:3 [success]
> {code}
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> It surfaced with 
> https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0
> Though it could be from: 
> https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0
>  which added some code to make a source node from the C-Data interface.
> However, the first call looks like it might be the line 
> https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15299) [R] investigate {remotes} dependencies "soft" vs TRUE

2022-02-10 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490581#comment-17490581
 ] 

Jonathan Keane commented on ARROW-15299:


Have we upstreamed these conclusions to the issue on {remotes}?

> [R] investigate {remotes} dependencies "soft" vs TRUE 
> --
>
> Key: ARROW-15299
> URL: https://issues.apache.org/jira/browse/ARROW-15299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Although the {{remotes::install_deps()}} docs state {{dependencies == TRUE}} 
> is equivalent to {{{}dependencies == "soft"{}}}, I suspect {{"soft"}} is a 
> bit more recursive than {{{}TRUE{}}}, resulting in the installation of many 
> more packages.
> {quote}TRUE is shorthand for "Depends", "Imports", "LinkingTo" and 
> "Suggests". NA is shorthand for "Depends", "Imports" and "LinkingTo" and is 
> the default. FALSE is shorthand for no dependencies (i.e. just check this 
> package, not its dependencies).
> The value "soft" means the same as TRUE, "hard" means the same as NA.
> {quote}
> I noticed, when using {{dependencies = "soft"}} that my session was being 
> clogged up by package installations lasting well over 40 minutes.
> I would be good to time box this to a couple of hours. 
> The direction in which I would go would be to understand if there is any 
> difference in the size of the dependency trees + come up with a minimal 
> reproducible example. 
> Edit (12 January, 2021): the output could be a table comparing  
> {code:r}
> remotes::install_deps(dependencies = TRUE)
> remotes::install_deps(dependencies = "hard")
> remotes::install_deps(dependencies = c("hard", "Config..."))
> remotes::install_deps(dependencies = "soft")
> remotes::install_deps(dependencies = c("soft", "Config..."))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction

2022-02-11 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15664:
--

 Summary: [C++] parquet reader Segfaults with illegal SIMD 
instruction 
 Key: ARROW-15664
 URL: https://issues.apache.org/jira/browse/ARROW-15664
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
Reporter: Jonathan Keane
 Fix For: 7.0.1, 8.0.0


When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
the same issues!) we get a segfault with an illegal opcode on systems that 
don't have BMI2 available when trying to read parquet files. (It turns out, the 
github runners for macos don't have BMI2, so this is easily testable there!)

Somehow in the optimization combined with the way our runtime detection code 
works, the runtime detection we normally use for this fails (though it works 
just fine with {{-O2}}, {{-O3}}, etc.).

When diagnosing this, I created a branch + PR that runs our R tests after 
installing from brew which can reliably cause this to happen: 
https://github.com/apache/arrow/pull/12364 other test suites that exercise 
parquet reading would probably have the same issue (or even C++ tests built 
with {{-Os}}.

Here's a coredump:
{code}
2491 Thread_829819
+ 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
+   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
+ 2491 void* 
std::__1::__thread_proxy >, 
arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
arrow.so) + 380  [0x109203749]
+   2491 arrow::internal::FnOnce::operator()() &&  (in arrow.so) + 
26  [0x109201f30]
+ 2491 arrow::internal::FnOnce::FnImpl >&, 
parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4&, unsigned long&, 
std::__1::shared_ptr > >::invoke()  (in 
arrow.so) + 98  [0x108f125c2]
+   2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4::operator()(unsigned long, 
std::__1::shared_ptr) const  (in arrow.so) + 
47  [0x108f11ed5]
+ 2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector > const&, parquet::arrow::ColumnReader*, 
std::__1::shared_ptr*)  (in arrow.so) + 273  [0x108f0c037]
+   2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
std::__1::shared_ptr*)  (in arrow.so) + 39  [0x108f0733b]
+ 2491 parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  [0x108f0794b]
+   2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader 
>::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
+ 2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader 
>::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
+   2491 void 
parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long 
long, parquet::internal::LevelInfo, 
parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
[0x108fc2a5a]
+ 2491 long long 
parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long 
long, long long, parquet::internal::LevelInfo, 
arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
+   2491 ???  (in )  [0x61354518]
+ 2491 _sigtramp  (in libsystem_platform.dylib) + 
29  [0x7ff801c57e2d]
+   2491 sigactionSegv  (in libR.dylib) + 649  
[0x1042598c9]  main.c:625
+ 2491 Rstd_ReadConsole  (in libR.dylib) + 2042 
 [0x10435160a]  sys-std.c:1044
+   2491 R_SelectEx  (in libR.dylib) + 308  
[0x104350854]  sys-std.c:178
+ 2491 __select  (in 
libsystem_kernel.dylib) + 10  [0x7ff801c0de4a]
{code}

And then a disassembly (where you can see a SHLX that shouldn't be there):

{code}
Dump of assembler code from 0x13ac6db00 to 0x13ac6db99ff:
 ...
--Type  for more, q to quit, c to continue without paging--
   0x00013ac6db82:  mov$0x8,%ecx
   0x00013ac6db87:  sub%rax,%rcx
   0x00013ac6db8a:  lea0xf1520b(%rip),%rdi# 0x13bb82d9c
   0x00013ac6db91:  movzbl (%rcx,%rdi,1),%edi
   0x00013ac6db95:  mov%esi,%ebx
   0x00013ac6db97:  and%edi,%ebx
=> 0x00013ac6db99:  shlx   %rax,%rbx,%rax
   0x00013ac6db9e:  or 0x18(%r15),%al
   0x00013ac6dba2:  mov%al,0x18(%r15)
   0x00013ac6dba6:  cmp%rdx,%rcx
   0x00013ac6dba9:  jg 0x13ac6dbf5
   0x00013ac6dbab:  mov%al,(%

[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction

2022-02-11 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490940#comment-17490940
 ] 

Jonathan Keane commented on ARROW-15664:


If you want to build with {{MinSizeRel}}, you'll need to add 
https://issues.apache.org/jira/browse/ARROW-15664#
{code}
elseif("${CMAKE_BUILD_TYPE}" STREQUAL "MINSIZEREL")
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${C_FLAGS_MINSIZEREL}")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_MINSIZEREL}")
{code}

to 
https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/cmake_modules/SetupCxxFlags.cmake#L633-L647
 for that to work

> [C++] parquet reader Segfaults with illegal SIMD instruction 
> -
>
> Key: ARROW-15664
> URL: https://issues.apache.org/jira/browse/ARROW-15664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Jonathan Keane
>Priority: Critical
> Fix For: 7.0.1, 8.0.0
>
>
> When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
> parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
> the same issues!) we get a segfault with an illegal opcode on systems that 
> don't have BMI2 available when trying to read parquet files. (It turns out, 
> the github runners for macos don't have BMI2, so this is easily testable 
> there!)
> Somehow in the optimization combined with the way our runtime detection code 
> works, the runtime detection we normally use for this fails (though it works 
> just fine with {{-O2}}, {{-O3}}, etc.).
> When diagnosing this, I created a branch + PR that runs our R tests after 
> installing from brew which can reliably cause this to happen: 
> https://github.com/apache/arrow/pull/12364 other test suites that exercise 
> parquet reading would probably have the same issue (or even C++ tests built 
> with {{-Os}}.
> Here's a coredump:
> {code}
> 2491 Thread_829819
> + 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
> +   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
> + 2491 void* 
> std::__1::__thread_proxy  std::__1::default_delete >, 
> arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
> arrow.so) + 380  [0x109203749]
> +   2491 arrow::internal::FnOnce::operator()() &&  (in arrow.so) 
> + 26  [0x109201f30]
> + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, 
> parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4&, unsigned long&, 
> std::__1::shared_ptr > >::invoke()  (in 
> arrow.so) + 98  [0x108f125c2]
> +   2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4::operator()(unsigned long, 
> std::__1::shared_ptr) const  (in arrow.so) 
> + 47  [0x108f11ed5]
> + 2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, 
> std::__1::shared_ptr*)  (in arrow.so) + 273  
> [0x108f0c037]
> +   2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
> std::__1::shared_ptr*)  (in arrow.so) + 39  [0x108f0733b]
> + 2491 parquet::arrow::(anonymous 
> namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  
> [0x108f0794b]
> +   2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
> + 2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
> +   2491 void 
> parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long 
> long, parquet::internal::LevelInfo, 
> parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
> [0x108fc2a5a]
> + 2491 long long 
> parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long 
> long, long long, parquet::internal::LevelInfo, 
> arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
> +   2491 ???  (in )  [0x61354518]
> + 2491 _sigtramp  (in libsystem_platform.dylib) + 
> 29  [0x7ff801c57e2d]
> +   2491 sigactionSegv  (in libR.dylib) + 649  
> [0x1042598c9]  main.c:625
> + 2491 Rstd_ReadConsole  (in libR.dylib) + 
> 2042  [0x1043516

[jira] [Commented] (ARROW-15299) [R] investigate {remotes} dependencies "soft" vs TRUE

2022-02-11 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490985#comment-17490985
 ] 

Jonathan Keane commented on ARROW-15299:


What you've got here is great. It's a clear explanation of how {{c('soft', 
'Config/Needs/website')}} doesn't do what we need, {{c('hard', 
'Config/Needs/website')}}, and how {{dependencies = TRUE}} is not the same as 
{{dependencies = hard}} (which could have a clearer message in the docs.



> [R] investigate {remotes} dependencies "soft" vs TRUE 
> --
>
> Key: ARROW-15299
> URL: https://issues.apache.org/jira/browse/ARROW-15299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Although the {{remotes::install_deps()}} docs state {{dependencies == TRUE}} 
> is equivalent to {{{}dependencies == "soft"{}}}, I suspect {{"soft"}} is a 
> bit more recursive than {{{}TRUE{}}}, resulting in the installation of many 
> more packages.
> {quote}TRUE is shorthand for "Depends", "Imports", "LinkingTo" and 
> "Suggests". NA is shorthand for "Depends", "Imports" and "LinkingTo" and is 
> the default. FALSE is shorthand for no dependencies (i.e. just check this 
> package, not its dependencies).
> The value "soft" means the same as TRUE, "hard" means the same as NA.
> {quote}
> I noticed, when using {{dependencies = "soft"}} that my session was being 
> clogged up by package installations lasting well over 40 minutes.
> I would be good to time box this to a couple of hours. 
> The direction in which I would go would be to understand if there is any 
> difference in the size of the dependency trees + come up with a minimal 
> reproducible example. 
> Edit (12 January, 2021): the output could be a table comparing  
> {code:r}
> remotes::install_deps(dependencies = TRUE)
> remotes::install_deps(dependencies = "hard")
> remotes::install_deps(dependencies = c("hard", "Config..."))
> remotes::install_deps(dependencies = "soft")
> remotes::install_deps(dependencies = c("soft", "Config..."))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction

2022-02-11 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491139#comment-17491139
 ] 

Jonathan Keane commented on ARROW-15664:


> Can it be reproduced without Homebrew?

Yes, if you build arrow with `CMAKE_RELEASE_TYPE=MinSizeRel` (and possibly even 
just `-Os`) you can experience the segfault

> [C++] parquet reader Segfaults with illegal SIMD instruction 
> -
>
> Key: ARROW-15664
> URL: https://issues.apache.org/jira/browse/ARROW-15664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Jonathan Keane
>Priority: Critical
> Fix For: 7.0.1, 8.0.0
>
>
> When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
> parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
> the same issues!) we get a segfault with an illegal opcode on systems that 
> don't have BMI2 available when trying to read parquet files. (It turns out, 
> the github runners for macos don't have BMI2, so this is easily testable 
> there!)
> Somehow in the optimization combined with the way our runtime detection code 
> works, the runtime detection we normally use for this fails (though it works 
> just fine with {{-O2}}, {{-O3}}, etc.).
> When diagnosing this, I created a branch + PR that runs our R tests after 
> installing from brew which can reliably cause this to happen: 
> https://github.com/apache/arrow/pull/12364 other test suites that exercise 
> parquet reading would probably have the same issue (or even C++ tests built 
> with {{-Os}}.
> Here's a coredump:
> {code}
> 2491 Thread_829819
> + 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
> +   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
> + 2491 void* 
> std::__1::__thread_proxy  std::__1::default_delete >, 
> arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
> arrow.so) + 380  [0x109203749]
> +   2491 arrow::internal::FnOnce::operator()() &&  (in arrow.so) 
> + 26  [0x109201f30]
> + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, 
> parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4&, unsigned long&, 
> std::__1::shared_ptr > >::invoke()  (in 
> arrow.so) + 98  [0x108f125c2]
> +   2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4::operator()(unsigned long, 
> std::__1::shared_ptr) const  (in arrow.so) 
> + 47  [0x108f11ed5]
> + 2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, 
> std::__1::shared_ptr*)  (in arrow.so) + 273  
> [0x108f0c037]
> +   2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
> std::__1::shared_ptr*)  (in arrow.so) + 39  [0x108f0733b]
> + 2491 parquet::arrow::(anonymous 
> namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  
> [0x108f0794b]
> +   2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
> + 2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
> +   2491 void 
> parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long 
> long, parquet::internal::LevelInfo, 
> parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
> [0x108fc2a5a]
> + 2491 long long 
> parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long 
> long, long long, parquet::internal::LevelInfo, 
> arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
> +   2491 ???  (in )  [0x61354518]
> + 2491 _sigtramp  (in libsystem_platform.dylib) + 
> 29  [0x7ff801c57e2d]
> +   2491 sigactionSegv  (in libR.dylib) + 649  
> [0x1042598c9]  main.c:625
> + 2491 Rstd_ReadConsole  (in libR.dylib) + 
> 2042  [0x10435160a]  sys-std.c:1044
> +   2491 R_SelectEx  (in libR.dylib) + 308  
> [0x104350854]  sys-std.c:178
> + 2491 __select  (in 
> libsystem_kernel.dylib) + 10  [0x7ff801c0de4a]
> {code}
> And then a disassembly (where you can see

[jira] [Created] (ARROW-15673) [R] Error gracefully if DuckDB isn't installed

2022-02-13 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15673:
--

 Summary: [R] Error gracefully if DuckDB isn't installed
 Key: ARROW-15673
 URL: https://issues.apache.org/jira/browse/ARROW-15673
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane
Assignee: Dragoș Moldovan-Grünfeld


Right now, the function {{to_duckdb()}} doesn't check to confirm that DuckDB is 
available. The error message isn't the worst (it'll mention {{duckdb::...}} not 
being found) it would be nicer to specifically tell folks they need to install 
the duckdb package.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15606) [CI] [R] Add brew build that exercises the R package

2022-02-14 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15606.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12364
[https://github.com/apache/arrow/pull/12364]

> [CI] [R] Add brew build that exercises the R package
> 
>
> Key: ARROW-15606
> URL: https://issues.apache.org/jira/browse/ARROW-15606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-02-14 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15678:
--

 Summary: [C++][CI] a crossbow job with MinRelSize enabled
 Key: ARROW-15678
 URL: https://issues.apache.org/jira/browse/ARROW-15678
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction

2022-02-14 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492197#comment-17492197
 ] 

Jonathan Keane commented on ARROW-15664:


https://github.com/apache/arrow/pull/12422 has a crossbow build that exercises 
{{-DCMAKE_RELEASE_TYPE=MinSizeRel}} outside of brew 

> [C++] parquet reader Segfaults with illegal SIMD instruction 
> -
>
> Key: ARROW-15664
> URL: https://issues.apache.org/jira/browse/ARROW-15664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Jonathan Keane
>Priority: Critical
> Fix For: 7.0.1, 8.0.0
>
>
> When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
> parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
> the same issues!) we get a segfault with an illegal opcode on systems that 
> don't have BMI2 available when trying to read parquet files. (It turns out, 
> the github runners for macos don't have BMI2, so this is easily testable 
> there!)
> Somehow in the optimization combined with the way our runtime detection code 
> works, the runtime detection we normally use for this fails (though it works 
> just fine with {{-O2}}, {{-O3}}, etc.).
> When diagnosing this, I created a branch + PR that runs our R tests after 
> installing from brew which can reliably cause this to happen: 
> https://github.com/apache/arrow/pull/12364 other test suites that exercise 
> parquet reading would probably have the same issue (or even C++ tests built 
> with {{-Os}}.
> Here's a coredump:
> {code}
> 2491 Thread_829819
> + 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
> +   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
> + 2491 void* 
> std::__1::__thread_proxy  std::__1::default_delete >, 
> arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
> arrow.so) + 380  [0x109203749]
> +   2491 arrow::internal::FnOnce::operator()() &&  (in arrow.so) 
> + 26  [0x109201f30]
> + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, 
> parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4&, unsigned long&, 
> std::__1::shared_ptr > >::invoke()  (in 
> arrow.so) + 98  [0x108f125c2]
> +   2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4::operator()(unsigned long, 
> std::__1::shared_ptr) const  (in arrow.so) 
> + 47  [0x108f11ed5]
> + 2491 parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, 
> std::__1::shared_ptr*)  (in arrow.so) + 273  
> [0x108f0c037]
> +   2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
> std::__1::shared_ptr*)  (in arrow.so) + 39  [0x108f0733b]
> + 2491 parquet::arrow::(anonymous 
> namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  
> [0x108f0794b]
> +   2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
> + 2491 parquet::internal::(anonymous 
> namespace)::TypedRecordReader 
> >::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
> +   2491 void 
> parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long 
> long, parquet::internal::LevelInfo, 
> parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
> [0x108fc2a5a]
> + 2491 long long 
> parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long 
> long, long long, parquet::internal::LevelInfo, 
> arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
> +   2491 ???  (in )  [0x61354518]
> + 2491 _sigtramp  (in libsystem_platform.dylib) + 
> 29  [0x7ff801c57e2d]
> +   2491 sigactionSegv  (in libR.dylib) + 649  
> [0x1042598c9]  main.c:625
> + 2491 Rstd_ReadConsole  (in libR.dylib) + 
> 2042  [0x10435160a]  sys-std.c:1044
> +   2491 R_SelectEx  (in libR.dylib) + 308  
> [0x104350854]  sys-std.c:178
> + 2491 __select  (in 
> libsystem_kernel.dylib) + 10  [0x7ff801c0de4a]
> {code}
> And then a disassembly (where you can see a SHLX that shouldn't be there):

[jira] [Resolved] (ARROW-15013) [R] Expose concatenate at the R level

2022-02-15 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15013.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12324
[https://github.com/apache/arrow/pull/12324]

> [R] Expose concatenate at the R level
> -
>
> Key: ARROW-15013
> URL: https://issues.apache.org/jira/browse/ARROW-15013
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Currently {{arrow::Concatenate()}} is not exposed at the R level. I imagine 
> the preferred way to deal with this from a user perspective is to use 
> {{ChunkedArray$create()}} for this; however, from a developer perspective 
> it's very difficult to replicate this functionality. For example, another 
> package using the Arrow C API might want to use the arrow R package to 
> concatenate record batches complex nested types instead of implementing it 
> themselves.
> Example usage in C++: 
> https://github.com/apache/arrow/blob/9cf4275a19c994879172e5d3b03ade9a96a10721/r/src/r_to_arrow.cpp#L1215



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15701) [C++][R] Should month allow integer inputs?

2022-02-16 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493283#comment-17493283
 ] 

Jonathan Keane commented on ARROW-15701:


This also should be accomplishable in the R bindings if we want to do it there 
quickly to unblock things before we decide/wait for the C++ implementation.

Basically, somewhere in 
https://github.com/apache/arrow/blob/1b9e76c6b07d557249a949c7c98d00997513d5cc/r/R/dplyr-funcs-datetime.R#L108-L119
 check the input type and do casting or follow a slightly different path and 
return an expression that creates the thing we need.

> [C++][R] Should month allow integer inputs?
> ---
>
> Key: ARROW-15701
> URL: https://issues.apache.org/jira/browse/ARROW-15701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> In R, more specifically in {{{}lubridate{}}}, {{month()}} can be used both to 
> get and set the corresponding component of a date. This means {{month()}} 
> accepts integer inputs.
> {code:r}
> suppressPackageStartupMessages(library(lubridate)) 
> month(1:12)
> #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
> month(1:13)
> #> Error in month.numeric(1:13): Values are not in 1:12
> {code}
> Solving this would allow us to implement bindings such as `semester()` in a 
> manner closer to {{{}lubridate{}}}.
> {code:r}
> suppressPackageStartupMessages(library(dplyr))
> suppressPackageStartupMessages(library(lubridate))
> test_df <- tibble(
>   month_as_int = c(1:12, NA),
>   month_as_char_pad = ifelse(month_as_int < 10, paste0("0", month_as_int), 
> month_as_int),
>   dates = as.Date(paste0("2021-", month_as_char_pad, "-15"))
> )
> test_df %>%
>   mutate(
> sem_date = semester(dates),
> sem_month_as_int = semester(month_as_int))
> #> # A tibble: 13 × 5
> #>month_as_int month_as_char_pad dates  sem_date sem_month_as_int
> #> 
> #>  11 012021-01-1511
> #>  22 022021-02-1511
> #>  33 032021-03-1511
> #>  44 042021-04-1511
> #>  55 052021-05-1511
> #>  66 062021-06-1511
> #>  77 072021-07-1522
> #>  88 082021-08-1522
> #>  99 092021-09-1522
> #> 10   10 102021-10-1522
> #> 11   11 112021-11-1522
> #> 12   12 122021-12-1522
> #> 13   NA   NA   NA   NA
> {code}
> Currently attempts to use {{month()}} with integer inputs errors with:
> {code:r}
> Function 'month' has no kernel matching input types (array[int32])
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch

2022-02-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15468.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12366
[https://github.com/apache/arrow/pull/12366]

> [R] [CI] A crossbow job that tests against DuckDB's dev branch
> --
>
> Key: ARROW-15468
> URL: https://issues.apache.org/jira/browse/ARROW-15468
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> It would be good to test against DuckDB's dev branch to warn us if there are 
> impending changes that break something.
> While we're doing this, we should clean up some of the Currently some of our 
> jobs do already 
> https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51
>  
> We should clean this up so that _generally_ builds use the released DuckDB, 
> but we can toggle dev DuckDB (and run a separate build that uses the dev 
> DuckDB optionally)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer

2022-02-16 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15708:
--

 Summary: [R] [CI] skip snappy encoded parquets on clang sanitizer
 Key: ARROW-15708
 URL: https://issues.apache.org/jira/browse/ARROW-15708
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer

2022-02-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15708:
---
Description: 
Showing up in our nightlies with:

{code}

#17 0x7f5004603315 in 
arrow::Future, 
std::__1::allocator > > > 
arrow::internal::OptionalParallelForAsync, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4&, 
std::__1::shared_ptr, 
std::__1::shared_ptr >(bool, 
std::__1::vector, 
std::__1::allocator > >, 
parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) 
/arrow/cpp/src/arrow/util/parallel.h:95:7
#18 0x7f50046026d9 in parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > 
const&, std::__1::vector > const&, 
arrow::internal::Executor*) /arrow/cpp/src/parquet/arrow/reader.cc:1198:10
#19 0x7f5004601994 in 
parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*, 
std::__1::shared_ptr, 
int, std::__1::vector > const&) 
/arrow/cpp/src/parquet/arrow/reader.cc:1090:18
#20 0x7f5004658330 in 
parquet::arrow::RowGroupGenerator::operator()()::'lambda'()::operator()() const 
/arrow/cpp/src/parquet/arrow/reader.cc:1064:14
#21 0x7f500465806f in 
std::__1::enable_if
 > ()> > >::value, void>::type 
arrow::detail::ContinueFuture::operator()
 > ()> >, 
arrow::Future
 > ()> > 
>(arrow::Future
 > ()> >, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&) const 
/arrow/cpp/src/arrow/util/future.h:177:9
#22 0x7f5004657c03 in void 
arrow::detail::ContinueFuture::IgnoringArgsIf
 > ()> >, arrow::internal::Empty const&>(std::__1::integral_constant, 
arrow::Future
 > ()> >&&, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&, 
arrow::internal::Empty const&) const /arrow/cpp/src/arrow/util/future.h:186:5
#23 0x7f500465797a in 
arrow::Future::ThenOnComplete::PassthruOnFailure
 >::operator()(arrow::Result const&) && 
/arrow/cpp/src/arrow/util/future.h:599:25
#24 0x7f5005c95bfe in arrow::internal::FnOnce::operator()(arrow::FutureImpl const&) && 
/arrow/cpp/src/arrow/util/functional.h:140:17
#25 0x7f5005c948f5 in 
arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr
 const&, arrow::FutureImpl::CallbackRecord&&, bool) 
/arrow/cpp/src/arrow/util/future.cc:298:7
#26 0x7f5005c94017 in 
arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) 
/arrow/cpp/src/arrow/util/future.cc:327:7
#27 0x7f50042731fe in 
arrow::Future::DoMarkFinished(arrow::Result)
 /arrow/cpp/src/arrow/util/future.h:712:14
#28 0x7f5004272df8 in void 
arrow::Future::MarkFinished(arrow::Status) /arrow/cpp/src/arrow/util/future.h:463:12
#29 0x7f500465244b in arrow::Future 
arrow::internal::Executor::DoTransfer, 
arrow::Status>(arrow::Future, 
bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) 
/arrow/cpp/src/arrow/util/thread_pool.h:217:21
#30 0x7f500465244b in 
arrow::Future::WrapStatusyOnComplete::Callback
 arrow::internal::Executor::DoTransfer, 
arrow::Status>(arrow::Future, 
bool)::'lambda'(arrow::Status const&)>::operator()(arrow::FutureImpl const&) && 
/arrow/cpp/src/arrow/util/future.h:509:9
#31 0x7f5005c95bfe in arrow::internal::FnOnce::operator()(arrow::FutureImpl const&) && 
/arrow/cpp/src/arrow/util/functional.h:140:17
#32 0x7f5005d6f147 in arrow::internal::FnOnce::operator()() && 
/arrow/cpp/src/arrow/util/functional.h:140:17
#33 0x7f5005d6dc65 in 
arrow::internal::WorkerLoop(std::__1::shared_ptr,
 std::__1::__list_iterator) 
/arrow/cpp/src/arrow/util/thread_pool.cc:178:11
#34 0x7f5005d6d67b in 
arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3::operator()() 
const /arrow/cpp/src/arrow/util/thread_pool.cc:349:7
#35 0x7f5005d6d67b in 
decltype(std::__1::forward(fp)())
 
std::__1::__invoke(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&)
 /usr/bin/../include/c++/v1/type_traits:3899:1
#36 0x7f5005d6cf9c in void* 
std::__1::__thread_proxy >, 
arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) 
/usr/bin/../include/c++/v1/thread:291:5
#37 0x7f50182343f8 in start_thread (/lib64/libpthread.so.0+0x93f8)
#38 0x7f50180774c2 in clone (/lib64/libc.so.6+0x1014c2)
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19810&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=4096

> [R] [CI] skip snappy encoded parquets on clang sanitizer
> 
>
> Key: ARROW-15708
> URL: https://issues.apache.org/jira/browse/ARROW-15708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>   

[jira] [Commented] (ARROW-15726) [R] Support push-down projection/filtering in datasets / dplyr

2022-02-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494767#comment-17494767
 ] 

Jonathan Keane commented on ARROW-15726:


This might be coincidence (though I suspect now...) Our dataset benchmarks are 
suddenly failing (with at least some of the failures being caused by attempting 
to read data that shouldn't be being read [1])

Elena has helped narrow down the range of possible commits that this could have 
happened in:

The first commit this might have happened in is 
https://github.com/apache/arrow/commit/a935c81b595d24179e115d64cda944efa93aa0e0
and 
https://github.com/apache/arrow/commit/afaa92e7e4289d6e4f302cc91810368794e8092b 
it for sure happens in, so it was that commit or before.

Here's an example buildkite log: 
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/145#ebd7ea7a-3fad-4865-9e73-49200d89ddd6/6-3230

[1] this is a bit in the weeds, so bear with me: The dataset we use for these 
benchmarks includes data in an early year that causes a schema failure 
{{Unsupported cast from string to null using function cast_null}}. The 
benchmarks that we wrote cleverly avoid selecting anything from this first year 
(so if filter pushdown is working, we don't get the error). It _has_ been 
working (for a while now! even with exec nodes), but suddenly about three days 
ago, that has actually stopped working and the benchmarks started failing

> [R] Support push-down projection/filtering in datasets / dplyr
> --
>
> Key: ARROW-15726
> URL: https://issues.apache.org/jira/browse/ARROW-15726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Weston Pace
>Priority: Major
>
> The following query should read a single column from the target parquet file.
> {noformat}
> open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) 
> %>% collect()
> {noformat}
> Furthermore, it should apply a pushdown filter to the source node allowing 
> parquet row groups to potentially filter out target data.
> At the moment it creates the following exec plan:
> {noformat}
> 3:SinkNode{}
>   2:ProjectNode{projection=[l_tax]}
> 1:FilterNode{filter=(l_tax < 0.01)}
>   0:SourceNode{}
> {noformat}
> There is no projection or filter in the source node.  As a result we end up 
> reading much more data from disk (the entire file) than we need to (at most a 
> single column).
> This _could_ be fixed via heuristics in the dplyr code.  However, it may 
> quickly get complex (for example, the project comes after the filter, so you 
> need to make sure you push down a projection that includes both the columns 
> accessed by the filter and the columns accessed by the projection OR can you 
> push down the projection through a join [yes you can], how do you know which 
> columns to apply to which source node).
> A more complete fix would be to call into some kind of 3rd party optimizer 
> (e.g. calcite) after the plan has been created by dplyr but before it is 
> passed to the execution engine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15731) [C++] Enable joins when data contains a list column

2022-02-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15731:
---
Summary: [C++] Enable joins when data contains a list column  (was:  Enable 
joins when data contains a list column)

> [C++] Enable joins when data contains a list column
> ---
>
> Key: ARROW-15731
> URL: https://issues.apache.org/jira/browse/ARROW-15731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stephanie Hazlitt
>Priority: Major
>
> Currently Arrow joins with data that contain a list column errors, even when 
> the list column is not a join key:
> ``` r
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
>                    jedi = c(FALSE, TRUE))
> arrow_table(starwars) %>%
>   left_join(jedi) %>%
>   collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Data type list is not supported in join non-key 
> field
> ```
> The ability to join would be a useful enhancement for workflows with tabular 
> data where list columns can be common, and for geospatial workflows where 
> geometry columns are stored as `list` or `fixed_size_list` (thanks 
> [~paleolimbot] for mentioning that use case).
> Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15731) [C++] Enable joins when data contains a list column

2022-02-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15731:
---
Description: 
Currently Arrow joins with data that contain a list column errors, even when 
the list column is not a join key:



{code}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
                   jedi = c(FALSE, TRUE))

arrow_table(starwars) %>%
  left_join(jedi) %>%
  collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Data type list is not supported in join non-key 
field
{code}

The ability to join would be a useful enhancement for workflows with tabular 
data where list columns can be common, and for geospatial workflows where 
geometry columns are stored as `list` or `fixed_size_list` (thanks 
[~paleolimbot] for mentioning that use case).

Related discussion here: ARROW-14519

 

  was:
Currently Arrow joins with data that contain a list column errors, even when 
the list column is not a join key:



``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
                   jedi = c(FALSE, TRUE))

arrow_table(starwars) %>%
  left_join(jedi) %>%
  collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Data type list is not supported in join non-key 
field
```

The ability to join would be a useful enhancement for workflows with tabular 
data where list columns can be common, and for geospatial workflows where 
geometry columns are stored as `list` or `fixed_size_list` (thanks 
[~paleolimbot] for mentioning that use case).

Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519

 


> [C++] Enable joins when data contains a list column
> ---
>
> Key: ARROW-15731
> URL: https://issues.apache.org/jira/browse/ARROW-15731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stephanie Hazlitt
>Priority: Major
>
> Currently Arrow joins with data that contain a list column errors, even when 
> the list column is not a join key:
> {code}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
>                    jedi = c(FALSE, TRUE))
> arrow_table(starwars) %>%
>   left_join(jedi) %>%
>   collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Data type list is not supported in join non-key 
> field
> {code}
> The ability to join would be a useful enhancement for workflows with tabular 
> data where list columns can be common, and for geospatial workflows where 
> geometry columns are stored as `list` or `fixed_size_list` (thanks 
> [~paleolimbot] for mentioning that use case).
> Related discussion here: ARROW-14519
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer

2022-02-19 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15708.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12443
[https://github.com/apache/arrow/pull/12443]

> [R] [CI] skip snappy encoded parquets on clang sanitizer
> 
>
> Key: ARROW-15708
> URL: https://issues.apache.org/jira/browse/ARROW-15708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Showing up in our nightlies with:
> {code}
> #17 0x7f5004603315 in 
> arrow::Future, 
> std::__1::allocator > > > 
> arrow::internal::OptionalParallelForAsync namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4&, 
> std::__1::shared_ptr, 
> std::__1::shared_ptr >(bool, 
> std::__1::vector, 
> std::__1::allocator > 
> >, parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) 
> /arrow/cpp/src/arrow/util/parallel.h:95:7
> #18 0x7f50046026d9 in parquet::arrow::(anonymous 
> namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr  namespace)::FileReaderImpl>, std::__1::vector 
> > const&, std::__1::vector > const&, 
> arrow::internal::Executor*) /arrow/cpp/src/parquet/arrow/reader.cc:1198:10
> #19 0x7f5004601994 in 
> parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*,
>  std::__1::shared_ptr, 
> int, std::__1::vector > const&) 
> /arrow/cpp/src/parquet/arrow/reader.cc:1090:18
> #20 0x7f5004658330 in 
> parquet::arrow::RowGroupGenerator::operator()()::'lambda'()::operator()() 
> const /arrow/cpp/src/parquet/arrow/reader.cc:1064:14
> #21 0x7f500465806f in 
> std::__1::enable_if
>  > ()> > >::value, void>::type 
> arrow::detail::ContinueFuture::operator()  
> arrow::Future
>  > ()> >, 
> arrow::Future
>  > ()> > 
> >(arrow::Future
>  > ()> >, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&) 
> const /arrow/cpp/src/arrow/util/future.h:177:9
> #22 0x7f5004657c03 in void 
> arrow::detail::ContinueFuture::IgnoringArgsIf  
> arrow::Future
>  > ()> >, arrow::internal::Empty const&>(std::__1::integral_constant true>, 
> arrow::Future
>  > ()> >&&, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&, 
> arrow::internal::Empty const&) const /arrow/cpp/src/arrow/util/future.h:186:5
> #23 0x7f500465797a in 
> arrow::Future::ThenOnComplete  
> arrow::Future::PassthruOnFailure
>  >::operator()(arrow::Result const&) && 
> /arrow/cpp/src/arrow/util/future.h:599:25
> #24 0x7f5005c95bfe in arrow::internal::FnOnce const&)>::operator()(arrow::FutureImpl const&) && 
> /arrow/cpp/src/arrow/util/functional.h:140:17
> #25 0x7f5005c948f5 in 
> arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr
>  const&, arrow::FutureImpl::CallbackRecord&&, bool) 
> /arrow/cpp/src/arrow/util/future.cc:298:7
> #26 0x7f5005c94017 in 
> arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) 
> /arrow/cpp/src/arrow/util/future.cc:327:7
> #27 0x7f50042731fe in 
> arrow::Future::DoMarkFinished(arrow::Result)
>  /arrow/cpp/src/arrow/util/future.h:712:14
> #28 0x7f5004272df8 in void 
> arrow::Future::MarkFinished void>(arrow::Status) /arrow/cpp/src/arrow/util/future.h:463:12
> #29 0x7f500465244b in arrow::Future 
> arrow::internal::Executor::DoTransfer arrow::Future, 
> arrow::Status>(arrow::Future, 
> bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) 
> /arrow/cpp/src/arrow/util/thread_pool.h:217:21
> #30 0x7f500465244b in 
> arrow::Future::WrapStatusyOnComplete::Callback
>  arrow::internal::Executor::DoTransfer arrow::Future, 
> arrow::Status>(arrow::Future, 
> bool)::'lambda'(arrow::Status const&)>::operator()(arrow::FutureImpl const&) 
> && /arrow/cpp/src/arrow/util/future.h:509:9
> #31 0x7f5005c95bfe in arrow::internal::FnOnce const&)>::operator()(arrow::FutureImpl const&) && 
> /arrow/cpp/src/arrow/util/functional.h:140:17
> #32 0x7f5005d6f147 in arrow::internal::FnOnce::operator()() && 
> /arrow/cpp/src/arrow/util/functional.h:140:17
> #33 0x7f5005d6dc65 in 
> arrow::internal::WorkerLoop(std::__1::shared_ptr,
>  std::__1::__list_iterator) 
> /arrow/cpp/src/arrow/util/thread_pool.cc:178:11
> #34 0x7f5005d6d67b in 
> arrow::internal::Thread

[jira] [Resolved] (ARROW-14817) [R] Implement bindings for lubridate::tz

2022-02-22 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14817.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12357
[https://github.com/apache/arrow/pull/12357]

> [R] Implement bindings for lubridate::tz
> 
>
> Key: ARROW-14817
> URL: https://issues.apache.org/jira/browse/ARROW-14817
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> This can be achieved via strftime



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14815) [R] Implement bindings for lubridate::semester

2022-02-22 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14815.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12429
[https://github.com/apache/arrow/pull/12429]

> [R] Implement bindings for lubridate::semester
> --
>
> Key: ARROW-14815
> URL: https://issues.apache.org/jira/browse/ARROW-14815
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14826) [R] Implement bindings for lubridate::dst()

2022-02-23 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14826.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12431
[https://github.com/apache/arrow/pull/12431]

> [R] Implement bindings for lubridate::dst()
> ---
>
> Key: ARROW-14826
> URL: https://issues.apache.org/jira/browse/ARROW-14826
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15673) [R] Error gracefully if DuckDB isn't installed

2022-02-23 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15673.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12486
[https://github.com/apache/arrow/pull/12486]

> [R] Error gracefully if DuckDB isn't installed
> --
>
> Key: ARROW-15673
> URL: https://issues.apache.org/jira/browse/ARROW-15673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Right now, the function {{to_duckdb()}} doesn't check to confirm that DuckDB 
> is available. The error message isn't the worst (it'll mention 
> {{duckdb::...}} not being found) it would be nicer to specifically tell folks 
> they need to install the duckdb package.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15697) [R] Add logo and meta tags to pkgdown site

2022-02-24 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15697.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12439
[https://github.com/apache/arrow/pull/12439]

> [R] Add logo and meta tags to pkgdown site
> --
>
> Key: ARROW-15697
> URL: https://issues.apache.org/jira/browse/ARROW-15697
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Danielle Navarro
>Assignee: Danielle Navarro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The pkgdown site currently doesn't use the Arrow logo and doesn't have nice 
> social media preview images



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498144#comment-17498144
 ] 

Jonathan Keane commented on ARROW-15785:


Do the Python [1] and R [2] benchmarks for single file reads do this?

Oddly(?) The python benchmarks do show a jump around January:
https://conbench.ursa.dev/benchmarks/8c5cc1a939d8485eb6c42af83f82c8c0/
https://conbench.ursa.dev/benchmarks/1b8d2dae6f664fd19579071a7cf7766b/

But the corresponding R ones do not: 
https://conbench.ursa.dev/benchmarks/ca493bf17af84ae5babd97f385b69afc/

[1] 
https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/file_benchmark.py
[2] https://github.com/ursacomputing/arrowbench/blob/main/R/bm-read-file.R

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146
 ] 

Jonathan Keane commented on ARROW-15785:


I think this is the PR that introduced the regression (though I might be 
totally off or it's a different one...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146
 ] 

Jonathan Keane edited comment on ARROW-15785 at 2/25/22, 2:16 PM:
--

I think this is the PR that introduced the regression (though I might be 
totally off or it's a different regression...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 


was (Author: jonkeane):
I think this is the PR that introduced the regression (though I might be 
totally off or it's a different one...) 

https://github.com/apache/arrow/pull/11991#issuecomment-1009216946

And the conbench run: 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/

We should probably have the conbench bot alert more loudly that there are 
regressions of this magnitude. That 5% there is supposed to indicate that 
there's an issue, but we might have that set too low such that there's alarm 
fatigue or|and we should alert louder when there are this many high-change 
benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and 
we alert at -5) 

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-26 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498464#comment-17498464
 ] 

Jonathan Keane commented on ARROW-15785:


I've also raised https://github.com/conbench/conbench/issues/307 since this 
should have been alerted a bit more loudly IMO

> [Benchmarks] Add conbench benchmark for single-file parquet reads
> -
>
> Key: ARROW-15785
> URL: https://issues.apache.org/jira/browse/ARROW-15785
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Benchmarking
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Release 7.0.0 introduced a regression in parquet single file reads.  We 
> should add a macro-level benchmark that does single-file reads to help us 
> detect this in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-13616) [R] Cheat Sheet Structure

2022-03-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-13616.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12445
[https://github.com/apache/arrow/pull/12445]

> [R] Cheat Sheet Structure
> -
>
> Key: ARROW-13616
> URL: https://issues.apache.org/jira/browse/ARROW-13616
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 5.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Hi
> I've created a folder on Google Drive that contains:
>  * SVG (Inkscape) drafts for the cheat sheet
>  * Arrow hex icon (SVG)
>  * *A document with the proposed text, please feel free to comment here*
> Link: 
> [https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing]
> Please open it and I'll give access to anyone who wants.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15599) [R] Convert a column as a sub-second timestamp from CSV file with the `T` col type option

2022-03-02 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15599.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12474
[https://github.com/apache/arrow/pull/12474]

> [R] Convert a column as a sub-second timestamp from CSV file with the `T` col 
> type option
> -
>
> Key: ARROW-15599
> URL: https://issues.apache.org/jira/browse/ARROW-15599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 6.0.1
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I tried to read the csv column type as timestamp, but I could only get it to 
> work well when `col_types` was not specified.
> I'm sorry if I missed something and this is the expected behavior. (It would 
> be great if you could add an example with `col_types` in the documentation.)
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> t_string <- tibble::tibble(
>   x = "2018-10-07 19:04:05.005"
> )
> write_csv_arrow(t_string, "tmp.csv")
> read_csv_arrow(
>   "tmp.csv",
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "?",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE,
>   skip = 1,
>   timestamp_parsers = "%Y-%m-%d %H:%M:%S"
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-03-02 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15678:
--

Assignee: (was: Jonathan Keane)

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-03-02 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500305#comment-17500305
 ] 

Jonathan Keane commented on ARROW-15678:


The pull request linked has the starts of this — but there's still an 
unidentified segfault in one of the tests 

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15798) [R][C++] Discussion: Plans for date casting from int to support an origin option?

2022-03-03 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500810#comment-17500810
 ] 

Jonathan Keane commented on ARROW-15798:


In response to what Dragoș posted there, I do wonder if date64 being turned 
into POSIXct when it comes back to R. That seems a bit off. Though there is 
some precision lost going from integer of MS to a float with fractional days 
(due to float imprecision) if we did date64 -> date backed by a float, but that 
has at least logical type consistency there. Thoughts?

> [R][C++] Discussion: Plans for date casting from int to support an origin 
> option?
> -
>
> Key: ARROW-15798
> URL: https://issues.apache.org/jira/browse/ARROW-15798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> 2 questions:
> * plans to support an origin option for int -> date32 casting?
> * plans to support double -> date32 casting? 
> ===
> Currently the casting from integer to date works, but assumes epoch 
> (1970-01-01) as the origin. 
> {code:r}
> > a <- Array$create(32L)
> > a$cast(date32())
> Array
> 
> [
>   1970-02-02
> ]
> {code}
> Would it make sense to have an {{origin}} option that would allow the user to 
> fine tune the casting? For example, in R the {{base::as.Date()}} function has 
> such an argument
> {code:r}
> > as.Date(32, origin = "1970-01-02")
> [1] "1970-02-03"
> {code}
> We have a potential workaround in R (once we support date & duration 
> arithmetic), but I was wondering if there might me more general interest for 
> this. 
> A secondary aspect (as my R example shows) R support casting to date not only 
> from integers, but also doubles. Would there be interesting in that? Need be 
> I can split this into several tickets.  
> Are there any plans in either of these 2 directions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14808) [R] Implement bindings for lubridate::date

2022-03-03 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14808.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12433
[https://github.com/apache/arrow/pull/12433]

> [R] Implement bindings for lubridate::date
> --
>
> Key: ARROW-14808
> URL: https://issues.apache.org/jira/browse/ARROW-14808
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 13h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15838) [C++] Key column behavior in joins

2022-03-03 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15838:
--

 Summary: [C++] Key column behavior in joins
 Key: ARROW-15838
 URL: https://issues.apache.org/jira/browse/ARROW-15838
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jonathan Keane


By default in dplyr (and possibly in pandas too?) coalesces the key column for 
full joins to be the (non-null) values from both key columns:

{code}
> left <- tibble::tibble(
  key = c(1, 2),
  A = c(0, 1),  
)  
left_tab <- Table$create(left)
> right <- tibble::tibble(
  key = c(2, 3),
  B = c(0, 1),
)  
right_tab <- Table$create(right)

> left %>% full_join(right) 
Joining, by = "key"
# A tibble: 3 × 3
key A B

1 1 0NA
2 2 1 0
3 3NA 1

> left_tab %>% full_join(right_tab) %>% collect()
# A tibble: 3 × 3
key A B

1 2 1 0
2 1 0NA
3NANA 1
{code}


And right join, we would expect the key from the right table to be in the 
result, but we get the key from the left instead:

{code}
> left <- tibble::tibble(
  key = c(1, 2),
  A = c(0, 1),  
)  
left_tab <- Table$create(left)
> right <- tibble::tibble(
  key = c(2, 3),
  B = c(0, 1),
)  
right_tab <- Table$create(right)
> left %>% right_join(right)
Joining, by = "key"
# A tibble: 2 × 3
key A B

1 2 1 0
2 3NA 1
> left_tab %>% right_join(right_tab) %>% collect()
# A tibble: 2 × 3
key A B

1 2 1 0
2NANA 1
{code}

Additionally, we should be able to keep both key columns with an option (cf 
https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15743) [R] `skip` not connected up to `skip_rows` on open_dataset despite error messages indicating otherwise

2022-03-04 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15743.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12523
[https://github.com/apache/arrow/pull/12523]

> [R] `skip` not connected up to `skip_rows` on open_dataset despite error 
> messages indicating otherwise
> --
>
> Key: ARROW-15743
> URL: https://issues.apache.org/jira/browse/ARROW-15743
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> If I open a dataset of CSVs with a schema, the error message tells me to 
> supply {{`skip = 1`}} if my data contains a header row (to prevent it being 
> read in as data), but only {{skip_rows = 1}} actually works.
> {code:r}
> library(arrow)
> library(dplyr)
> td <- tempfile()
> dir.create(td)
> write_dataset(mtcars, td, format = "csv")
> schema <- schema(mpg = float64(), cyl = float64(), disp = float64(), hp = 
> float64(), 
> drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
> am = float64(), gear = float64(), carb = float64())
> open_dataset(td, format = "csv", schema = schema) %>%
>   collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Could not open CSV input source 
> '/tmp/RtmppZbpeF/file6cec135ed29c/part-0.csv': Invalid: In CSV column #0: Row 
> #1: CSV conversion error to double: invalid value 'mpg'
> #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
> size, quoted, &value)
> #> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
> #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
> parser.VisitColumn(col_index, visit)
> #> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:463  
> arrow::internal::UnwrapOrRaise(maybe_decoded_arrays)
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:445  
> iterator_.Next()
> #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:336  ReadNext(&batch)
> #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:347  ReadAll(&batches)
> #> ℹ If you have supplied a schema and your data contains a header row, you 
> should supply the argument `skip = 1` to prevent the header being read in as 
> data.
> open_dataset(td, format = "csv", schema = schema, skip = 1) %>%
>   collect()
> #> Error: The following option is supported in "read_delim_arrow" functions 
> but not yet supported here: "skip"
> open_dataset(td, format = "csv", schema = schema, skip_rows = 1) %>%
>   collect()
> #> # A tibble: 32 × 11
> #>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
> #>  
> #>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
> #>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
> #>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
> #>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
> #>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
> #>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
> #>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
> #>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
> #>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
> #> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
> #> # … with 22 more rows
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15656) [C++] [R] Valgrind error with C-data interface

2022-03-07 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502323#comment-17502323
 ] 

Jonathan Keane commented on ARROW-15656:


cc [~apitrou]

> [C++] [R] Valgrind error with C-data interface
> --
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: R_execClosure (eval.c:1918)
> ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
> ==10301==  Uninitialised value was created by a heap allocation
> ==10301==at 0x483E0F0: memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0x483E212: posix_memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0xF4756DF: arrow::(anonymous 
> namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
> (memory_pool.cc:365)
> ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) 
> (memory_pool.cc:557)
> ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
> ==10301==by 0xF041EC2: arrow::Status 
> GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1} const&) (memorypool.cpp:46)
> ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
> (memorypool.cpp:28)
> ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) 
> (memory_pool.cc:921)
> ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
> (memory_pool.cc:945)
> ==10301==by 0xF478A74: ResizePoolBuffer, 
> std::unique_ptr > (memory_pool.cc:984)
> ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
> (memory_pool.cc:992)
> ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
> (buffer.cc:174)
> ==10301==by 0xF38CC77: arrow::(anonymous 
> namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > 
> const&, arrow::MemoryPool*, std::shared_ptr*) 
> (concatenate.cc:81)
> ==10301== 
>   test-dataset.R:852:3 [success]
> {code}
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> It surfaced with 
> https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0
> Though it could be from: 
> https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0
>  which added some code to make a source node from the C-Data interface.
> However, the first call looks like it might be the line 
> https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15775) [R] Clean up as.* methods to use build_expr()

2022-03-08 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15775.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12563
[https://github.com/apache/arrow/pull/12563]

> [R] Clean up as.* methods to use build_expr()
> -
>
> Key: ARROW-15775
> URL: https://issues.apache.org/jira/browse/ARROW-15775
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Originally raised as part of [PR 
> #12433|https://github.com/apache/arrow/pull/12433].
> {quote}
> This implementation made me think of the various as.* methods we have defined 
> [1] (since this is similar to as.Date()). Which all use a simpler setup to 
> create a cast operation. However, I noticed that for those, they are using 
> Expression$create(...) rather than the build_expr(...) helper [2]. That 
> build_expr(...) here should handle the wrapping of R objects into Scalars 
> (...)
> (...)We should also open a jira if one doesn't exist to clean up those as.* 
> methods to use build_expr()
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15701) [R] month() should allow integer inputs

2022-03-09 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15701.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12482
[https://github.com/apache/arrow/pull/12482]

> [R] month() should allow integer inputs
> ---
>
> Key: ARROW-15701
> URL: https://issues.apache.org/jira/browse/ARROW-15701
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> *Conclusion*: we will implement this in the R bindings - month will allow 
> integer input: {{month(int)}} will return {{int}} as long as {{int}} is 
> between 1 and 12.
> ==
> In R, more specifically in {{{}lubridate{}}}, {{month()}} can be used both to 
> get and set the corresponding component of a date. This means {{month()}} 
> accepts integer inputs.
> {code:r}
> suppressPackageStartupMessages(library(lubridate)) 
> month(1:12)
> #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
> month(1:13)
> #> Error in month.numeric(1:13): Values are not in 1:12
> {code}
> Solving this would allow us to implement bindings such as `semester()` in a 
> manner closer to {{{}lubridate{}}}.
> {code:r}
> suppressPackageStartupMessages(library(dplyr))
> suppressPackageStartupMessages(library(lubridate))
> test_df <- tibble(
>   month_as_int = c(1:12, NA),
>   month_as_char_pad = ifelse(month_as_int < 10, paste0("0", month_as_int), 
> month_as_int),
>   dates = as.Date(paste0("2021-", month_as_char_pad, "-15"))
> )
> test_df %>%
>   mutate(
> sem_date = semester(dates),
> sem_month_as_int = semester(month_as_int))
> #> # A tibble: 13 × 5
> #>month_as_int month_as_char_pad dates  sem_date sem_month_as_int
> #> 
> #>  11 012021-01-1511
> #>  22 022021-02-1511
> #>  33 032021-03-1511
> #>  44 042021-04-1511
> #>  55 052021-05-1511
> #>  66 062021-06-1511
> #>  77 072021-07-1522
> #>  88 082021-08-1522
> #>  99 092021-09-1522
> #> 10   10 102021-10-1522
> #> 11   11 112021-11-1522
> #> 12   12 122021-12-1522
> #> 13   NA   NA   NA   NA
> {code}
> Currently attempts to use {{month()}} with integer inputs errors with:
> {code:r}
> Function 'month' has no kernel matching input types (array[int32])
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15168) [R] Add S3 generics to create main Arrow objects

2022-03-10 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504651#comment-17504651
 ] 

Jonathan Keane commented on ARROW-15168:


This sounds good to me. We do have a few of these helpers (though they aren't 
generics...) like {{arrow_table}}. I'm fine with transitioning all of those to 
{{as_...}} versions of themselves, or we could drop the {{as_}} and repurpose 
them (AFAIK {{arrow_table}} is literally an alias for {{Table$create}} right 
now.)

> [R] Add S3 generics to create main Arrow objects
> 
>
> Key: ARROW-15168
> URL: https://issues.apache.org/jira/browse/ARROW-15168
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>
> Right now we create Tables, RecordBatches, ChunkedArrays, and Arrays using 
> the corresponding {{$create()}} functions (or a few shortcut functions). This 
> works well for converting other Arrow or base R types to Arow objects but 
> doesn’t work well for objects in other packages (e.g., sf). This is related 
> to ARROW-14378 in that it provides a mechanism for other packages support 
> writing objects to Arrow in a more Arrow-native form instead of serializing 
> attributes that are unlikely to be readable in other packages. Many of these 
> came up when experimenting with {{carrow}} when trying to provide seamless 
> arrow package compatibility for S3 objects that wrap external pointers to C 
> API data structures. S3 is a good way to do this because the other package 
> doesn't have to put arrow in {{Imports}} since it's a heavy dependency.
> For argument’s sake I’ll propose adding the following methods: 
> -   {{as_arrow_array(x, type = NULL)}} -> {{Array}} 
> -   {{as_arrow_chunked_array(x, type = NULL)}} -> {{ChunkedArray}} 
> -   {{as_arrow_record_batch(x, schema = NULL)}} -> {{RecordBatch}} 
> -   {{as_arrow_table(x, schema = NULL)}} -> {{Table}} 
> -   {{as_arrow_data_type(x)}} -> {{DataType}} 
> -   {{as_arrow_record_batch_reader(x, schema = NULL)}} -> 
> {{RecordBatchReader}} 
> I’ll note that use {{as_adq()}} internally for similar reasons (to convert a 
> few different object types into a arrow dplyr query when that’s the data 
> structure we need). 
> As part of this ticket, if we choose to move forward, we should implement the 
> default methods with some internal consistency (i.e., somebody wanting to 
> provide Arrow support in a package probably only has to implement 
> {{as_arrow_array()}} to get most support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12212) [R][CI] Test nightly on solaris

2022-03-10 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504655#comment-17504655
 ] 

Jonathan Keane commented on ARROW-12212:


I'm closing this since we no longer need to jump through this particularly 
special hoop.

> [R][CI] Test nightly on solaris
> ---
>
> Key: ARROW-12212
> URL: https://issues.apache.org/jira/browse/ARROW-12212
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Priority: Major
>
> Followup to ARROW-10734. Setting up a solaris vm on github actions may be 
> possible. We can try to setup https://github.com/vmactions/solaris-vm with R 
> from https://files.r-hub.io/opencsw/. A temporary solution could be a nightly 
> r-hub build kicked off by the arrow-r-nightly CI; it would email me with the 
> results. Not ideal but it would at least alert us to issues closer to when 
> they are merged and not just at release time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (ARROW-12212) [R][CI] Test nightly on solaris

2022-03-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12212.
--
Resolution: Won't Fix

> [R][CI] Test nightly on solaris
> ---
>
> Key: ARROW-12212
> URL: https://issues.apache.org/jira/browse/ARROW-12212
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Priority: Major
>
> Followup to ARROW-10734. Setting up a solaris vm on github actions may be 
> possible. We can try to setup https://github.com/vmactions/solaris-vm with R 
> from https://files.r-hub.io/opencsw/. A temporary solution could be a nightly 
> r-hub build kicked off by the arrow-r-nightly CI; it would email me with the 
> results. Not ideal but it would at least alert us to issues closer to when 
> they are merged and not just at release time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15914) [CI] Separate verification from tests in nightly

2022-03-10 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15914:
--

 Summary: [CI] Separate verification from tests in nightly
 Key: ARROW-15914
 URL: https://issues.apache.org/jira/browse/ARROW-15914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Jonathan Keane


Could we split up the nightly report that has the test builds from the report 
that has the verification builds (and maybe include all the packaging into the 
separate verification builds?)

The verification builds tend to take much longer than the test builds, so are 
frequently still pending even with 6 hours between starting the test builds and 
running the report.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14199) [R] bindings for format where possible

2022-03-11 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14199.

Resolution: Fixed

Issue resolved by pull request 12319
[https://github.com/apache/arrow/pull/12319]

> [R] bindings for format where possible
> --
>
> Key: ARROW-14199
> URL: https://issues.apache.org/jira/browse/ARROW-14199
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Now that we have {{strftime}}, we should also be able to make bindings for 
> {{format()}} as well. This might be complicated / we might need to punt on a 
> bunch of types that {{format()}} can take but arrow doesn't (yet) support 
> formatting of them, that's ok. 
> Though some of those might be wrappable with a handful of kernels stacked 
> together: {{format(float)}} might be round + cast to character



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14797) [R] write_feather R Arrow freezing on windows 11

2022-03-15 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507062#comment-17507062
 ] 

Jonathan Keane commented on ARROW-14797:


Sorry for the long delay here. Have you tried this again since we released 
7.0.0? 

> [R] write_feather R Arrow freezing on windows 11
> 
>
> Key: ARROW-14797
> URL: https://issues.apache.org/jira/browse/ARROW-14797
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 6.0.1
> Environment: windows 11, rstudio/r 4.1.2
>Reporter: Xavier Timbeau
>Priority: Critical
>
> When writing multiple large files using write_feather (possibly parquet also) 
> on windows 11 using the arrow R package, write_feather is freezing at some 
> point (after a few files copied). Changing cpu_count from 16 to 4 seems to 
> solve the issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15665) [C++] Add error handling option to StrptimeOptions

2022-03-15 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507179#comment-17507179
 ] 

Jonathan Keane commented on ARROW-15665:


(3) sounds wrong, but like I said before: you ([~dragosmg]) should look at what 
happens in python or if there is some standard where that is the indeed the 
right thing.

(1) + (2) both sound like they could be implemented as "if strptime fails to 
parse, (optionally) return null". No reason for us to go to far into why it 
didn't parse.

> [C++] Add error handling option to StrptimeOptions
> --
>
> Key: ARROW-15665
> URL: https://issues.apache.org/jira/browse/ARROW-15665
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We want to have an option to either raise, ignore or return NA in case of 
> format mismatch.
> See 
> [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
>  and lubridate 
> [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
>  for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (ARROW-14797) [R] write_feather R Arrow freezing on windows 11

2022-03-15 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-14797.
--
Resolution: Resolved

> [R] write_feather R Arrow freezing on windows 11
> 
>
> Key: ARROW-14797
> URL: https://issues.apache.org/jira/browse/ARROW-14797
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 6.0.1
> Environment: windows 11, rstudio/r 4.1.2
>Reporter: Xavier Timbeau
>Priority: Critical
>
> When writing multiple large files using write_feather (possibly parquet also) 
> on windows 11 using the arrow R package, write_feather is freezing at some 
> point (after a few files copied). Changing cpu_count from 16 to 4 seems to 
> solve the issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15970) [R] [CI] Re-enable DuckDB dev tests

2022-03-18 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15970:
--

 Summary: [R] [CI] Re-enable DuckDB dev tests
 Key: ARROW-15970
 URL: https://issues.apache.org/jira/browse/ARROW-15970
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Jonathan Keane
Assignee: Jonathan Keane


When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should 
re-enable the DuckDB dev branch tests that we disabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15970) [R] [CI] Re-enable DuckDB dev tests

2022-03-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15970:
---
Description: 
When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should 
re-enable the DuckDB dev branch tests that we disabled.

PR that disabled: https://github.com/apache/arrow/pull/12666

  was:When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should 
re-enable the DuckDB dev branch tests that we disabled.


> [R] [CI] Re-enable DuckDB dev tests
> ---
>
> Key: ARROW-15970
> URL: https://issues.apache.org/jira/browse/ARROW-15970
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>
> When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should 
> re-enable the DuckDB dev branch tests that we disabled.
> PR that disabled: https://github.com/apache/arrow/pull/12666



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15973) [CI] Split nightly reports into three: Tests, Packaging, Release

2022-03-18 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15973:
--

 Summary: [CI] Split nightly reports into three: Tests, Packaging, 
Release
 Key: ARROW-15973
 URL: https://issues.apache.org/jira/browse/ARROW-15973
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15929) [R] io_thread_count is actually the CPU thread count

2022-03-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15929.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12633
[https://github.com/apache/arrow/pull/12633]

> [R] io_thread_count is actually the CPU thread count
> 
>
> Key: ARROW-15929
> URL: https://issues.apache.org/jira/browse/ARROW-15929
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/blob/5cb5afc40547b4f75739e31ff8632c71a10d3084/r/src/threadpool.cpp#L51-L57]
> This accidentally references the wrong pool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15875) [R][C++] Include md5sum in S3 method for GetFileInfo()

2022-03-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15875.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12623
[https://github.com/apache/arrow/pull/12623]

> [R][C++] Include md5sum in S3 method for GetFileInfo()
> --
>
> Key: ARROW-15875
> URL: https://issues.apache.org/jira/browse/ARROW-15875
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GetFileInfo() seems to include mtime, size, path and type.  For an S3 system, 
> it would be nice to be able to reference the md5 sum without transferring the 
> file, (which I think the server will have already computed?).  This seems 
> like the logical place to include it (though I wouldn't object to a more 
> visible method too).  
>  
>  
> (though type isn't clear to me, since it appears to be an integer)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15627) [R] Support unify_schemas for union datasets

2022-03-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15627.

Resolution: Fixed

Issue resolved by pull request 12629
[https://github.com/apache/arrow/pull/12629]

> [R] Support unify_schemas for union datasets
> 
>
> Key: ARROW-15627
> URL: https://issues.apache.org/jira/browse/ARROW-15627
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Also out of discussion on [https://github.com/apache/arrow/issues/12371]
> You can unify schemas between different parquet files, but it seems like you 
> can't union together two (or more) datasets that have different schemas. This 
> is odd, because we do compute the unified schema on [this 
> line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
>  only to later assert all the schemas are the same.
> {code:R}
> library(arrow)
> library(dplyr)
> df1 <- arrow_table(x = array(c(1, 2, 3)),
>y = array(c("a", "b", "c")))
> df2 <- arrow_table(x = array(c(4, 5)),
>z = array(c("d", "e")))
> df1 %>% write_dataset("example1", format="parquet")
> df2 %>% write_dataset("example2", format="parquet")
> ds1 <- open_dataset("example1", format="parquet")
> ds2 <- open_dataset("example2", format="parquet")
> # These don't work
> ds <- c(ds1, ds2) # c() actually does the same thing
> ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
> ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas 
> = TRUE)
> # This does
> ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), 
> format="parquet", unify_schemas = TRUE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14679) [R] [C++] Handle suffix argument in joins

2022-03-21 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14679.

Resolution: Fixed

Issue resolved by pull request 12113
[https://github.com/apache/arrow/pull/12113]

> [R] [C++] Handle suffix argument in joins
> -
>
> Key: ARROW-14679
> URL: https://issues.apache.org/jira/browse/ARROW-14679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 8.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something 
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting 
> errors when trying), I couldn't tell if there were tests of this — I couldn't 
> find any, so I'm not sure if I'm calling this wrong or if it's not working at 
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is 
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to 
> provide new names?) in the tests I wrote I've worked around this, but it 
> would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15802) [R] Implement bindings for lubridate::make_datetime() and lubridate::make_date()

2022-03-21 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15802.

Resolution: Fixed

Issue resolved by pull request 12622
[https://github.com/apache/arrow/pull/12622]

> [R] Implement bindings for lubridate::make_datetime() and 
> lubridate::make_date()
> 
>
> Key: ARROW-15802
> URL: https://issues.apache.org/jira/browse/ARROW-15802
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> plus {{base::ISOdate()}} and {{base::ISOdatetime()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15489) [R] Expand RecordBatchReader use-ability

2022-03-21 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15489.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12567
[https://github.com/apache/arrow/pull/12567]

> [R] Expand RecordBatchReader use-ability 
> -
>
> Key: ARROW-15489
> URL: https://issues.apache.org/jira/browse/ARROW-15489
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In ARROW-14745 we thought about having {{to_arrow()}} returning a 
> RecordBatchReader only. Though this would work, it's not quite as friendly as 
> wrapping the RecordBatchReader since {{arrow_dplyr_query}}s have a (slightly) 
> nicer print method.
> We should add more methods and a print method that makes it clearer what a 
> RecordBatchReader is and what it might be useful to do (e.g. continue a dplyr 
> query)
> Is it possible that we could make up a name/class that encompasses all of the 
> Arrow tabular like things that we could wrap all of these up in (for UX 
> purposes only, really). We have ArrowTabular now, maybe we lean into that 
> more (along side an LazyArrowTabular like dbplyr has?).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-22 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510845#comment-17510845
 ] 

Jonathan Keane commented on ARROW-16007:


Good catch, as far as I see there was not a principled reason to diverge like 
this.

We would of course welcome a PR to add tests for this + update the Arrow 
{{grepl}} behavior to match base R's. Let us know if you would like any pointers

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15857) [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)

2022-03-24 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511856#comment-17511856
 ] 

Jonathan Keane commented on ARROW-15857:


{sass} just had an update (purportedly to fix warnings in gcc), but that has 
not resolved these issues. Though, _on cran_ the devel-fedora-clang build is 
just fine: https://cran.r-project.org/web/checks/check_results_sass.html 

The image itself are a bit stale — I've reported that to rhub to ask them to 
reenable their jobs: https://github.com/r-hub/rhub-linux-builders/issues/59 but 
I don't think that should matter.

The symbol it's complaining about _is_ part of stdlib: 
{{_ZTINSt3__113basic_ostreamIcNS_11char_traitsIc}} which is slightly 
different in this setup to match the special cran setup see: 
https://github.com/apache/arrow/blob/623a15e7f7a45578733956714c8dddcc9f66f015/ci/scripts/r_docker_configure.sh#L45-L55
 



> [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)
> --
>
> Key: ARROW-15857
> URL: https://issues.apache.org/jira/browse/ARROW-15857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>
> Starting 2022-03-03, we get a failure on the rhub/fedora-clang-devel nightly 
> build. It seems to be a linking error but nothing in the sass package seems 
> to have changed for some time (last update May 2021).
> https://github.com/ursacomputing/crossbow/runs/5444005154?check_suite_focus=true#step:5:3007
> Build log for the sass package:
> {noformat}
> #14 1099.2 make[1]: Entering directory 
> '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src'
> #14 1099.2 /opt/R-devel/lib64/R/share/make/shlib.mk:18: warning: overriding 
> recipe for target 'shlib-clean'
> #14 1099.2 Makevars:12: warning: ignoring old recipe for target 'shlib-clean'
> #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG 
> -I./libsass/include  -I/usr/local/include   -fpic  -g -O2  -c compile.c -o 
> compile.o
> #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG 
> -I./libsass/include  -I/usr/local/include   -fpic  -g -O2  -c init.c -o init.o
> #14 1099.2 MAKEFLAGS= CC="/usr/bin/clang" CFLAGS="-g -O2 " 
> CXX="/usr/bin/clang++ -std=gnu++14 -stdlib=libc++" AR="ar" 
> LDFLAGS="-L/usr/local/lib64" make -C libsass
> #14 1099.2 make[2]: Entering directory 
> '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src/libsass'
> #14 1099.2 /usr/bin/clang -g -O2  -O2 -I ./include  -fPIC -c -o src/cencode.o 
> src/cencode.c
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast.o src/ast.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_values.o src/ast_values.cpp
> #14 1099.2 src/ast_values.cpp:484:23: warning: loop variable 'numerator' 
> creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' 
> [-Wrange-loop-construct]
> #14 1099.2   for (const auto numerator : numerators)
> #14 1099.2   ^
> #14 1099.2 src/ast_values.cpp:484:12: note: use reference type 'const 
> std::__1::basic_string, 
> std::__1::allocator> &' to prevent copying
> #14 1099.2   for (const auto numerator : numerators)
> #14 1099.2^~
> #14 1099.2   &
> #14 1099.2 src/ast_values.cpp:486:23: warning: loop variable 'denominator' 
> creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' 
> [-Wrange-loop-construct]
> #14 1099.2   for (const auto denominator : denominators)
> #14 1099.2   ^
> #14 1099.2 src/ast_values.cpp:486:12: note: use reference type 'const 
> std::__1::basic_string, 
> std::__1::allocator> &' to prevent copying
> #14 1099.2   for (const auto denominator : denominators)
> #14 1099.2^~~~
> #14 1099.2   &
> #14 1099.2 2 warnings generated.
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_supports.o src/ast_supports.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_cmp.o src/ast_sel_cmp.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_unify.o src/ast_sel_unify.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_super.o src/ast_sel_super.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_weave.o src/ast_sel_weave.cpp
> #14

[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-24 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511971#comment-17511971
 ] 

Jonathan Keane commented on ARROW-16007:


You've definitely identified the two paths for this. I agree with your 
hesitance that empty string and nulls shouldn't be conflated. 

The string conversion from R is a bit complicated, but 
https://github.com/apache/arrow/blob/ddb663b1724034f64cc53d62bd2d5a4e8fa42954/r/src/r_to_arrow.cpp#L777-L824
 (and the rest of that file) is a good starting point.


All of that being said, I would probably go the second route you mention (and 
sorry for not responding with this earlier!):

>  but then it occurred to me that if this is just a special case in R, maybe 
> it's better to do it on the R side and just change NA to FALSE in the return 
> value of the binding of grepl?

You could put a call to {{if_else}} + {{is.na}} bindings inside of the 
{{grepl}} binding and get the behavior in R. We do have some support via 
options for different null handling behaviors for other functions, but I 
suspect R is a bit of an outlier here (I tried to construct a reprex in Python 
to see if I could see what it does, but every `re.match()` with anything 
missing-like is a type error!). 

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-24 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512025#comment-17512025
 ] 

Jonathan Keane commented on ARROW-16007:


Yeah the (possibly slightly misnamed) {{call_binding}} builds an Expression 
that will later be evaluated at {{collect}} time. A similar setup is used at 
https://github.com/apache/arrow/blob/012ae6e961dbb472c7862f40be5dc972a9bd3e91/r/R/dplyr-funcs-datetime.R#L221-L226
 in {{ISOdatetime}} to run {{sec=NA}} into {{0}} here (another R oddity!)

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()

2022-03-24 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15098.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12506
[https://github.com/apache/arrow/pull/12506]

> [R] Add binding for lubridate::duration() and/or as.difftime()
> --
>
> Key: ARROW-15098
> URL: https://issues.apache.org/jira/browse/ARROW-15098
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> After ARROW-14941 we have support for the duration type; however, there is no 
> binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr 
> evaluation that could create these objects. I'm actually not sure if we 
> should bind {{lubridate::duration}} since it returns a custom S4 class that's 
> identical in function to base R's difftime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16030) [R] Add a schema method for arrow_dplyr_query

2022-03-25 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16030:
--

 Summary: [R] Add a schema method for arrow_dplyr_query
 Key: ARROW-16030
 URL: https://issues.apache.org/jira/browse/ARROW-16030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane


we have {{implicit_schema()}} which can generate the final schema for a query, 
though it's not exported. Maybe we "just" export that for people to be able to 
get what the schema of the resulting query will be.

Alternatively we could make a {{schema}} (S3) method that would return the 
(implicit) schema with {{schema(query_obj)}}. Though this might be overloading 
{{schema}} since that is not how we retrieve schemas elsewhere (e.g. 
{{schema(arrow_table(mtcars))}} does not currently work).

One use case: https://github.com/duckdb/duckdb/pull/3299



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15656) [C++] [R] Valgrind error with C-data interface

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15656:
--

Assignee: Jonathan Keane

> [C++] [R] Valgrind error with C-data interface
> --
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: R_execClosure (eval.c:1918)
> ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
> ==10301==  Uninitialised value was created by a heap allocation
> ==10301==at 0x483E0F0: memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0x483E212: posix_memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0xF4756DF: arrow::(anonymous 
> namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
> (memory_pool.cc:365)
> ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) 
> (memory_pool.cc:557)
> ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
> ==10301==by 0xF041EC2: arrow::Status 
> GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1} const&) (memorypool.cpp:46)
> ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
> (memorypool.cpp:28)
> ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) 
> (memory_pool.cc:921)
> ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
> (memory_pool.cc:945)
> ==10301==by 0xF478A74: ResizePoolBuffer, 
> std::unique_ptr > (memory_pool.cc:984)
> ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
> (memory_pool.cc:992)
> ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
> (buffer.cc:174)
> ==10301==by 0xF38CC77: arrow::(anonymous 
> namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > 
> const&, arrow::MemoryPool*, std::shared_ptr*) 
> (concatenate.cc:81)
> ==10301== 
>   test-dataset.R:852:3 [success]
> {code}
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> It surfaced with 
> https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0
> Though it could be from: 
> https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0
>  which added some code to make a source node from the C-Data interface.
> However, the first call looks like it might be the line 
> https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15656) [C++] [R] Make valgrind builds slightly quicker

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Summary: [C++] [R] Make valgrind builds slightly quicker  (was: [C++] [R] 
Valgrind error with C-data interface)

> [C++] [R] Make valgrind builds slightly quicker
> ---
>
> Key: ARROW-15656
> URL: https://issues.apache.org/jira/browse/ARROW-15656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is currently failing on our valgrind nightly:
> {code}
> ==10301==by 0x49A2184: bcEval (eval.c:7107)
> ==10301==by 0x498DBC8: Rf_eval (eval.c:748)
> ==10301==by 0x4990937: R_execClosure (eval.c:1918)
> ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
> ==10301==  Uninitialised value was created by a heap allocation
> ==10301==at 0x483E0F0: memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0x483E212: posix_memalign (in 
> /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==10301==by 0xF4756DF: arrow::(anonymous 
> namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
> (memory_pool.cc:365)
> ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) 
> (memory_pool.cc:557)
> ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
> ==10301==by 0xF041EC2: arrow::Status 
> GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned 
> char**)::{lambda()#1} const&) (memorypool.cpp:46)
> ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
> (memorypool.cpp:28)
> ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) 
> (memory_pool.cc:921)
> ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
> (memory_pool.cc:945)
> ==10301==by 0xF478A74: ResizePoolBuffer, 
> std::unique_ptr > (memory_pool.cc:984)
> ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
> (memory_pool.cc:992)
> ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
> (buffer.cc:174)
> ==10301==by 0xF38CC77: arrow::(anonymous 
> namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > 
> const&, arrow::MemoryPool*, std::shared_ptr*) 
> (concatenate.cc:81)
> ==10301== 
>   test-dataset.R:852:3 [success]
> {code}
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> It surfaced with 
> https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0
> Though it could be from: 
> https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0
>  which added some code to make a source node from the C-Data interface.
> However, the first call looks like it might be the line 
> https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15656) [C++] [R] Make valgrind builds slightly quicker

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15656:
---
Description: 
It looks like these specific errors have been resolved in other tickets. In the 
process of isolating the issue, I found that we actually were building arrow 
twice in the build. So I've repurposed this PR to remove the extraneous build.

=

This is currently failing on our valgrind nightly:

{code}
==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

It surfaced with 
https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0

Though it could be from: 
https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 
which added some code to make a source node from the C-Data interface.

However, the first call looks like it might be the line 
https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437
 

  was:
This is currently failing on our valgrind nightly:

{code}
==10301==by 0x49A2184: bcEval (eval.c:7107)
==10301==by 0x498DBC8: Rf_eval (eval.c:748)
==10301==by 0x4990937: R_execClosure (eval.c:1918)
==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844)
==10301==  Uninitialised value was created by a heap allocation
==10301==at 0x483E0F0: memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0x483E212: posix_memalign (in 
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==10301==by 0xF4756DF: arrow::(anonymous 
namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) 
(memory_pool.cc:365)
==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) 
(memory_pool.cc:557)
==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1}::operator()() const (memorypool.cpp:28)
==10301==by 0xF041EC2: arrow::Status 
GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned 
char**)::{lambda()#1} const&) (memorypool.cpp:46)
==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) 
(memorypool.cpp:28)
==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921)
==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) 
(memory_pool.cc:945)
==10301==by 0xF478A74: ResizePoolBuffer, 
std::unique_ptr > (memory_pool.cc:984)
==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) 
(memory_pool.cc:992)
==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) 
(buffer.cc:174)
==10301==by 0xF38CC77: arrow::(anonymous 
namespace)::ConcatenateBitmaps(std::vector > 
const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81)
==10301== 
  test-dataset.R:852:3 [success]
{code}

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

It surfaced with 
https://github.com/apache/arrow/commit/

[jira] [Updated] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16007:
---
Component/s: R

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16034:
--

Assignee: Andy Teucher

> [R] should bindings for grepl etc emit warnings matching those in base R 
> functions
> --
>
> Key: ARROW-16034
> URL: https://issues.apache.org/jira/browse/ARROW-16034
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Assignee: Andy Teucher
>Priority: Minor
>
> {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit 
> a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in 
> the [PR]([https://github.com/apache/arrow/pull/12711]) for this 
> [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this 
> should be mimicked in the arrow bindings as well? [~jonkeane] requested I 
> open this here for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16007:
--

Assignee: Andy Teucher

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Assignee: Andy Teucher
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16007.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12711
[https://github.com/apache/arrow/pull/12711]

> [R] binding for grepl has different behaviour with NA compared to R base grepl
> --
>
> Key: ARROW-16007
> URL: https://issues.apache.org/jira/browse/ARROW-16007
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Assignee: Andy Teucher
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The arrow binding to {{grepl}} behaves slightly differently than the base R 
> {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base 
> {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is 
> consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and 
> {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in 
> arrow.
> I don't know if this is something you would want to change so that the 
> {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this 
> difference?
> Reprex:
>  
> {code:r}
> library(arrow, warn.conflicts = FALSE, quietly = TRUE)
> library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
> library(stringr, quietly = TRUE)
> alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
> alpha_dataset <- InMemoryDataset$create(alpha_df)
> mutate(alpha_df, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a"))
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3        FALSE           NA
> mutate(alpha_dataset, 
>        grepl_is_a = grepl("a", alpha), 
>        stringr_is_a = str_detect(alpha, "a")) |> 
>   collect()
> #>   alpha grepl_is_a stringr_is_a
> #> 1 alpha       TRUE         TRUE
> #> 2   bet      FALSE        FALSE
> #> 3           NA           NA
> # base R grepl returns FALSE for NA
> grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
> #> [1]  TRUE FALSE FALSE
> grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
> #> [1]  TRUE FALSE FALSE
> # stringr::str_dectect returns NA for NA
> str_detect(alpha_df$alpha, "a")
> #> [1]  TRUE FALSE    NA
> alpha_array <- Array$create(alpha_df$alpha)
> # arrow functions return null for null (NA)
> call_function("match_substring_regex", alpha_array, options = list(pattern = 
> "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> call_function("match_substring", alpha_array, options = list(pattern = "a"))
> #> Array
> #> 
> #> [
> #>   true,
> #>   false,
> #>   null
> #> ]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions

2022-03-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16034:
--

Assignee: (was: Andy Teucher)

> [R] should bindings for grepl etc emit warnings matching those in base R 
> functions
> --
>
> Key: ARROW-16034
> URL: https://issues.apache.org/jira/browse/ARROW-16034
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>
> {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit 
> a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in 
> the [PR]([https://github.com/apache/arrow/pull/12711]) for this 
> [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this 
> should be mimicked in the arrow bindings as well? [~jonkeane] requested I 
> open this here for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions

2022-03-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512645#comment-17512645
 ] 

Jonathan Keane commented on ARROW-16034:


This is definitely nice to give folks a heads up that one of the arguments is 
being ignored. Looking at the implementation for {{grepl}}, it wouldn't take 
much overhead at all to add a warning when those two arguments are given. I 
would be in favor of doing this to be extra clear and helpful. I'm curious if 
[~icook] has any reason this didn't or wouldn't work form the original 
implementation?

> [R] should bindings for grepl etc emit warnings matching those in base R 
> functions
> --
>
> Key: ARROW-16034
> URL: https://issues.apache.org/jira/browse/ARROW-16034
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andy Teucher
>Priority: Minor
>
> {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit 
> a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in 
> the [PR]([https://github.com/apache/arrow/pull/12711]) for this 
> [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this 
> should be mimicked in the arrow bindings as well? [~jonkeane] requested I 
> open this here for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16047) [C++] Option for match_substring* to return NULL on NULL input

2022-03-28 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16047:
--

 Summary: [C++] Option for match_substring* to return NULL on NULL 
input
 Key: ARROW-16047
 URL: https://issues.apache.org/jira/browse/ARROW-16047
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jonathan Keane


We implemented this in the R bindings as a wrapper in ARROW-16007 (and there's 
even a start of an implementation at 
https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f 
provided by [~ateucher] )

This is should be done for {{match_substring}} and {{match_substring_regex}} 
(any others?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15857) [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)

2022-03-28 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513469#comment-17513469
 ] 

Jonathan Keane commented on ARROW-15857:


OH OH OH yes, of course it's that! Now I feel silly I didn't catch that, I saw 
in the logs that it's a linking issue, but didn't connect that through to 
{{LDFLAGS}} thanks for the catch!

> [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)
> --
>
> Key: ARROW-15857
> URL: https://issues.apache.org/jira/browse/ARROW-15857
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Starting 2022-03-03, we get a failure on the rhub/fedora-clang-devel nightly 
> build. It seems to be a linking error but nothing in the sass package seems 
> to have changed for some time (last update May 2021).
> https://github.com/ursacomputing/crossbow/runs/5444005154?check_suite_focus=true#step:5:3007
> Build log for the sass package:
> {noformat}
> #14 1099.2 make[1]: Entering directory 
> '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src'
> #14 1099.2 /opt/R-devel/lib64/R/share/make/shlib.mk:18: warning: overriding 
> recipe for target 'shlib-clean'
> #14 1099.2 Makevars:12: warning: ignoring old recipe for target 'shlib-clean'
> #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG 
> -I./libsass/include  -I/usr/local/include   -fpic  -g -O2  -c compile.c -o 
> compile.o
> #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG 
> -I./libsass/include  -I/usr/local/include   -fpic  -g -O2  -c init.c -o init.o
> #14 1099.2 MAKEFLAGS= CC="/usr/bin/clang" CFLAGS="-g -O2 " 
> CXX="/usr/bin/clang++ -std=gnu++14 -stdlib=libc++" AR="ar" 
> LDFLAGS="-L/usr/local/lib64" make -C libsass
> #14 1099.2 make[2]: Entering directory 
> '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src/libsass'
> #14 1099.2 /usr/bin/clang -g -O2  -O2 -I ./include  -fPIC -c -o src/cencode.o 
> src/cencode.c
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast.o src/ast.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_values.o src/ast_values.cpp
> #14 1099.2 src/ast_values.cpp:484:23: warning: loop variable 'numerator' 
> creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' 
> [-Wrange-loop-construct]
> #14 1099.2   for (const auto numerator : numerators)
> #14 1099.2   ^
> #14 1099.2 src/ast_values.cpp:484:12: note: use reference type 'const 
> std::__1::basic_string, 
> std::__1::allocator> &' to prevent copying
> #14 1099.2   for (const auto numerator : numerators)
> #14 1099.2^~
> #14 1099.2   &
> #14 1099.2 src/ast_values.cpp:486:23: warning: loop variable 'denominator' 
> creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' 
> [-Wrange-loop-construct]
> #14 1099.2   for (const auto denominator : denominators)
> #14 1099.2   ^
> #14 1099.2 src/ast_values.cpp:486:12: note: use reference type 'const 
> std::__1::basic_string, 
> std::__1::allocator> &' to prevent copying
> #14 1099.2   for (const auto denominator : denominators)
> #14 1099.2^~~~
> #14 1099.2   &
> #14 1099.2 2 warnings generated.
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_supports.o src/ast_supports.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_cmp.o src/ast_sel_cmp.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_unify.o src/ast_sel_unify.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_super.o src/ast_sel_super.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_sel_weave.o src/ast_sel_weave.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/ast_selectors.o src/ast_selectors.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/context.o src/context.cpp
> #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 
> -I ./include  -fPIC -c -o src/constants.o src/constants.cpp
> #14 1099.2 /usr/bin/clang

[jira] [Resolved] (ARROW-15814) [R] Improve documentation for cast()

2022-03-28 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15814.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12546
[https://github.com/apache/arrow/pull/12546]

> [R] Improve documentation for cast()
> 
>
> Key: ARROW-15814
> URL: https://issues.apache.org/jira/browse/ARROW-15814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Originated in the 
> [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465]
>  for ARROW-14820.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15814) [R] Improve documentation for cast()

2022-03-28 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15814:
--

Assignee: Jonathan Keane

> [R] Improve documentation for cast()
> 
>
> Key: ARROW-15814
> URL: https://issues.apache.org/jira/browse/ARROW-15814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Originated in the 
> [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465]
>  for ARROW-14820.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15814) [R] Improve documentation for cast()

2022-03-28 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15814:
--

Assignee: SHIMA Tatsuya  (was: Jonathan Keane)

> [R] Improve documentation for cast()
> 
>
> Key: ARROW-15814
> URL: https://issues.apache.org/jira/browse/ARROW-15814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Originated in the 
> [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465]
>  for ARROW-14820.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16047) [C++] Option for match_substring* to return NULL on NULL input

2022-03-28 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513539#comment-17513539
 ] 

Jonathan Keane commented on ARROW-16047:


I created this ticket as a (possible) follow on and for discussion about how 
widespread this behavior is + see if it's feasible to do this in C++.

If you're still interested in digging more into the C++ code here, feel free to 
assign this ticket + send a PR [~ateucher] (but no pressure!).

> [C++] Option for match_substring* to return NULL on NULL input
> --
>
> Key: ARROW-16047
> URL: https://issues.apache.org/jira/browse/ARROW-16047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
>
> We implemented this in the R bindings as a wrapper in ARROW-16007 (and 
> there's even a start of an implementation at 
> https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f
>  provided by [~ateucher] )
> This is should be done for {{match_substring}} and {{match_substring_regex}} 
> (any others?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16047) [C++] Option for match_substring* to return false on NULL input

2022-03-28 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513650#comment-17513650
 ] 

Jonathan Keane commented on ARROW-16047:


Ah yes, of course. I've updated the title there

> [C++] Option for match_substring* to return false on NULL input
> ---
>
> Key: ARROW-16047
> URL: https://issues.apache.org/jira/browse/ARROW-16047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
>
> We implemented this in the R bindings as a wrapper in ARROW-16007 (and 
> there's even a start of an implementation at 
> https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f
>  provided by [~ateucher] )
> This is should be done for {{match_substring}} and {{match_substring_regex}} 
> (any others?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16047) [C++] Option for match_substring* to return false on NULL input

2022-03-28 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16047:
---
Summary: [C++] Option for match_substring* to return false on NULL input  
(was: [C++] Option for match_substring* to return NULL on NULL input)

> [C++] Option for match_substring* to return false on NULL input
> ---
>
> Key: ARROW-16047
> URL: https://issues.apache.org/jira/browse/ARROW-16047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
>
> We implemented this in the R bindings as a wrapper in ARROW-16007 (and 
> there's even a start of an implementation at 
> https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f
>  provided by [~ateucher] )
> This is should be done for {{match_substring}} and {{match_substring_regex}} 
> (any others?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16052) [R] undefined global function %>%

2022-03-28 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16052:
--

 Summary: [R] undefined global function %>%
 Key: ARROW-16052
 URL: https://issues.apache.org/jira/browse/ARROW-16052
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   6   7   8   9   10   >