[jira] [Created] (ARROW-15529) [C++] Add rows scanned to open telemetry / profiling
Jonathan Keane created ARROW-15529: -- Summary: [C++] Add rows scanned to open telemetry / profiling Key: ARROW-15529 URL: https://issues.apache.org/jira/browse/ARROW-15529 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane It's not described at https://duckdb.org/dev/profiling but this is included in DuckDB's profiling and was helpful in looking at their scans to see the benefits of predicate push down -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15539) [Archery] Add ARROW_JEMALLOC to build options
[ https://issues.apache.org/jira/browse/ARROW-15539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15539. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12325 [https://github.com/apache/arrow/pull/12325] > [Archery] Add ARROW_JEMALLOC to build options > - > > Key: ARROW-15539 > URL: https://issues.apache.org/jira/browse/ARROW-15539 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Elena Henderson >Assignee: Elena Henderson >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Steps to reproduce: > > {code:java} > $ export ARROW_JEMALLOC=OFF > $ archery benchmark run --repetitions 1 > -- Building using CMake version: 3.21.3 > -- The C compiler identification is Clang 11.1.0 > -- The CXX compiler identification is Clang 11.1.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: > /Users/voltrondata/miniconda3/envs/arrow-commit/bin/arm64-apple-darwin20.0.0-clang > - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: > /Users/voltrondata/miniconda3/envs/arrow-commit/bin/arm64-apple-darwin20.0.0-clang++ > - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 7.0.0 (full: '7.0.0-SNAPSHOT') > -- Arrow SO version: 700 (full: 700.0.0) > -- clang-tidy 12 not found > -- clang-format 12 not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: > /Users/voltrondata/miniconda3/envs/arrow-commit/bin/python3.8 (found version > "3.8.12") found components: Interpreter > -- Using ccache: /Users/voltrondata/miniconda3/envs/arrow-commit/bin/ccache > -- Found cpplint executable at > /opt/homebrew/var/buildkite-agent/builds/test-mac-arm/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/arrow/cpp/build-support/cpplint.py > -- System processor: arm64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success > -- Arrow build warning level: PRODUCTION > Configured for RELEASE build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: RELEASE > -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT > -- Performing Test CXX_LINKER_SUPPORTS_VERSION_SCRIPT - Failed > -- Using CONDA approach to find dependencies > -- Using CONDA_PREFIX for ARROW_PACKAGE_PREFIX: > /Users/voltrondata/miniconda3/envs/arrow-commit > -- Setting (unset) dependency *_ROOT variables: > /Users/voltrondata/miniconda3/envs/arrow-commit > -- ARROW_ABSL_BUILD_VERSION: 20210324.2 > -- ARROW_ABSL_BUILD_SHA256_CHECKSUM: > 59b862f50e710277f8ede96f083a5bb8d7c9595376146838b9580be90374ee1f > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.133 > -- ARROW_AWSSDK_BUILD_SHA256_CHECKSUM: > d6c495bc06be5e21dac716571305d77437e7cfd62a2226b8fe48d9ab5785a8d6 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.12 > -- ARROW_AWS_CHECKSUMS_BUILD_SHA256_CHECKSUM: > 394723034b81cc7cd528401775bc7aca2b12c7471c92350c80a0e2fb9d2909fe > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.6.9 > -- ARROW_AWS_C_COMMON_BUILD_SHA256_CHECKSUM: > 928a3e36f24d1ee46f9eec360ec5cebfe8b9b8994fe39d4fa74ff51aebb12717 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_SHA256_CHECKSUM: > f1b423a487b5d6dca118bfc0d0c6cc596dc476b282258a3228e73a8f730422d4 > -- ARROW_BOOST_BUILD_VERSION: 1.75.0 > -- ARROW_BOOST_BUILD_SHA256_CHECKSUM: > 267e04a7c0bfe85daf796dedc789c3a27a76707e1c968f0a2a87bb96331e2b61 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.9 > -- ARROW_BROTLI_BUILD_SHA256_CHECKSUM: > f9e8d81d0405ba66d181529af42a3354f838c939095ff99930da6aa9cdf6fe46 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_BZIP2_BUILD_SHA256_CHECKSUM: > ab5a03176ee106d3f0fa90e381da478ddae405918153cca248e682cd0c4a2269 > -- ARROW_CARES_BUILD_VERSION: 1.17.2 > -- ARROW_CARES_BUILD_SHA256_CHECKSUM: > 4803c844ce20ce510ef0eb83f8ea41fa24ecaae9d280c468c582d2bb25b3913d > -- ARROW_CRC32C_BUILD_VERSION: 1.1.2 > -- ARROW_CRC32C_BUILD_SHA256_CHECKSUM: > ac07840513072b7fcebda6e821068aa04889018f24e10e46181068fb214d7e56 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.6.0 > -- ARROW_GBENCHMARK_BUILD_SHA256_CHECKSUM: > 1f71c72ce08d2c1310011ea6436b31e39ccab8c2db94186d26657d41747c85d6 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GFLAGS_BUILD_SHA256_CHECKSUM: > 34af2f15cf7367513b352bdcd2493ab14ce43692d2dcd9dfc499492966c64dcf > -- ARROW_GLOG_BUILD_VERSION: v0.5.0 > -- ARROW_GLOG_BUILD_SHA256_CHECKSUM: > eede71f28371bf39aa69b45de23b329d37214016e2055269b3b5e7cf
[jira] [Resolved] (ARROW-15480) [R] Expand on schema/colnames mismatch error messages
[ https://issues.apache.org/jira/browse/ARROW-15480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15480. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12277 [https://github.com/apache/arrow/pull/12277] > [R] Expand on schema/colnames mismatch error messages > - > > Key: ARROW-15480 > URL: https://issues.apache.org/jira/browse/ARROW-15480 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > In ARROW-14744 extra checks were added for when {{open_dataset()}} is used > and there are conflicts between column names from the schema vs. passed in > explicitly - we should expand on the messaging and tests for the different > possible cases here. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14169) [R] altrep for factors
[ https://issues.apache.org/jira/browse/ARROW-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14169. Resolution: Fixed Issue resolved by pull request 11738 [https://github.com/apache/arrow/pull/11738] > [R] altrep for factors > -- > > Key: ARROW-14169 > URL: https://issues.apache.org/jira/browse/ARROW-14169 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Romain Francois >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
Jonathan Keane created ARROW-15605: -- Summary: [CI] [R] Keep using old macos runners on our autobrew CI job Key: ARROW-15605 URL: https://issues.apache.org/jira/browse/ARROW-15605 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
[ https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15605: -- Assignee: Jonathan Keane > [CI] [R] Keep using old macos runners on our autobrew CI job > > > Key: ARROW-15605 > URL: https://issues.apache.org/jira/browse/ARROW-15605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14745) [R] Enable true duckdb streaming
[ https://issues.apache.org/jira/browse/ARROW-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14745. Resolution: Fixed Issue resolved by pull request 11730 [https://github.com/apache/arrow/pull/11730] > [R] Enable true duckdb streaming > > > Key: ARROW-14745 > URL: https://issues.apache.org/jira/browse/ARROW-14745 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15570) [CI][Nightly] Drop centos-8 R nightly job
[ https://issues.apache.org/jira/browse/ARROW-15570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15570. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12337 [https://github.com/apache/arrow/pull/12337] > [CI][Nightly] Drop centos-8 R nightly job > - > > Key: ARROW-15570 > URL: https://issues.apache.org/jira/browse/ARROW-15570 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It has started failing since CentOS 8 went EOL. Followup to ARROW-15038, > which removed all of the other CentOS 8 jobs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15606) [CI] [R] Add brew release build
Jonathan Keane created ARROW-15606: -- Summary: [CI] [R] Add brew release build Key: ARROW-15606 URL: https://issues.apache.org/jira/browse/ARROW-15606 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package
[ https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15606: --- Summary: [CI] [R] Add brew build that exercises the R package (was: [CI] [R] Add brew release build) > [CI] [R] Add brew build that exercises the R package > > > Key: ARROW-15606 > URL: https://issues.apache.org/jira/browse/ARROW-15606 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
[ https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15605. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12363 [https://github.com/apache/arrow/pull/12363] > [CI] [R] Keep using old macos runners on our autobrew CI job > > > Key: ARROW-15605 > URL: https://issues.apache.org/jira/browse/ARROW-15605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15616) [R] [CI] Fail the build if there is a documentation mismatch?
Jonathan Keane created ARROW-15616: -- Summary: [R] [CI] Fail the build if there is a documentation mismatch? Key: ARROW-15616 URL: https://issues.apache.org/jira/browse/ARROW-15616 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane We want to make sure that we are documentating as we go. One possible solution is to add a CI job that fails the build if rogygenize() writes anything new out (h/t Neal) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15020) [R] Add bindings for new dataset writing options
[ https://issues.apache.org/jira/browse/ARROW-15020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15020. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12185 [https://github.com/apache/arrow/pull/12185] > [R] Add bindings for new dataset writing options > > > Key: ARROW-15020 > URL: https://issues.apache.org/jira/browse/ARROW-15020 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Added a child task to do R bindings separately. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15654) TPC-H Data Generator Node
[ https://issues.apache.org/jira/browse/ARROW-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490418#comment-17490418 ] Jonathan Keane commented on ARROW-15654: ARROW-3998 also exists and has some discussion there — any reason to not use that to track this work? > TPC-H Data Generator Node > - > > Key: ARROW-15654 > URL: https://issues.apache.org/jira/browse/ARROW-15654 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Sasha Krassovsky >Assignee: Sasha Krassovsky >Priority: Major > > To aid with benchmarking and profiling the engine on TPC-H, I'd like to build > an arrow-native TPC-H data generator scan node, and then later implement exec > plans for each TPC-H query in a benchmark program. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15654) TPC-H Data Generator Node
[ https://issues.apache.org/jira/browse/ARROW-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490418#comment-17490418 ] Jonathan Keane edited comment on ARROW-15654 at 2/10/22, 6:18 PM: -- ARROW-3998 also exists and has some discussion there — any reason to not use that to track this work? I'm happy to do the jira work to make that happen if we want to track it there, of course! was (Author: jonkeane): ARROW-3998 also exists and has some discussion there — any reason to not use that to track this work? > TPC-H Data Generator Node > - > > Key: ARROW-15654 > URL: https://issues.apache.org/jira/browse/ARROW-15654 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Sasha Krassovsky >Assignee: Sasha Krassovsky >Priority: Major > > To aid with benchmarking and profiling the engine on TPC-H, I'd like to build > an arrow-native TPC-H data generator scan node, and then later implement exec > plans for each TPC-H query in a benchmark program. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15655) [R] Find a way to make the size of our vignettes smaller
Jonathan Keane created ARROW-15655: -- Summary: [R] Find a way to make the size of our vignettes smaller Key: ARROW-15655 URL: https://issues.apache.org/jira/browse/ARROW-15655 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Jonathan Keane Or move them such that they are on our website, but not shipped with the source package. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15656) [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Description: This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 though likely was from https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 was: This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 though likely was from https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > [R] Valgrind error with C-data interface > > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jonathan Keane >Priority: Major > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937:
[jira] [Created] (ARROW-15656) [R] Valgrind error with C-data interface
Jonathan Keane created ARROW-15656: -- Summary: [R] Valgrind error with C-data interface Key: ARROW-15656 URL: https://issues.apache.org/jira/browse/ARROW-15656 Project: Apache Arrow Issue Type: Improvement Reporter: Jonathan Keane This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 though likely was from https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15656) [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Description: This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 Though it could be from: https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 which added some code to make a source node from the C-Data interface. However, the first call looks like it might be the line https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 was: This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 though likely was from https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > [R] Valgrind error with C-data interface > > >
[jira] [Updated] (ARROW-15656) [C++] [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Summary: [C++] [R] Valgrind error with C-data interface (was: [R] Valgrind error with C-data interface) > [C++] [R] Valgrind error with C-data interface > -- > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jonathan Keane >Priority: Major > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937: R_execClosure (eval.c:1918) > ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) > ==10301== Uninitialised value was created by a heap allocation > ==10301==at 0x483E0F0: memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0x483E212: posix_memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0xF4756DF: arrow::(anonymous > namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) > (memory_pool.cc:365) > ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) > (memory_pool.cc:557) > ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) > ==10301==by 0xF041EC2: arrow::Status > GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1} const&) (memorypool.cpp:46) > ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) > (memorypool.cpp:28) > ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) > (memory_pool.cc:921) > ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) > (memory_pool.cc:945) > ==10301==by 0xF478A74: ResizePoolBuffer, > std::unique_ptr > (memory_pool.cc:984) > ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) > (memory_pool.cc:992) > ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) > (buffer.cc:174) > ==10301==by 0xF38CC77: arrow::(anonymous > namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > > const&, arrow::MemoryPool*, std::shared_ptr*) > (concatenate.cc:81) > ==10301== > test-dataset.R:852:3 [success] > {code} > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 > It surfaced with > https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 > Though it could be from: > https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > which added some code to make a source node from the C-Data interface. > However, the first call looks like it might be the line > https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15656) [C++] [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Component/s: C++ R > [C++] [R] Valgrind error with C-data interface > -- > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Jonathan Keane >Priority: Major > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937: R_execClosure (eval.c:1918) > ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) > ==10301== Uninitialised value was created by a heap allocation > ==10301==at 0x483E0F0: memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0x483E212: posix_memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0xF4756DF: arrow::(anonymous > namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) > (memory_pool.cc:365) > ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) > (memory_pool.cc:557) > ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) > ==10301==by 0xF041EC2: arrow::Status > GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1} const&) (memorypool.cpp:46) > ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) > (memorypool.cpp:28) > ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) > (memory_pool.cc:921) > ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) > (memory_pool.cc:945) > ==10301==by 0xF478A74: ResizePoolBuffer, > std::unique_ptr > (memory_pool.cc:984) > ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) > (memory_pool.cc:992) > ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) > (buffer.cc:174) > ==10301==by 0xF38CC77: arrow::(anonymous > namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > > const&, arrow::MemoryPool*, std::shared_ptr*) > (concatenate.cc:81) > ==10301== > test-dataset.R:852:3 [success] > {code} > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 > It surfaced with > https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 > Though it could be from: > https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > which added some code to make a source node from the C-Data interface. > However, the first call looks like it might be the line > https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15299) [R] investigate {remotes} dependencies "soft" vs TRUE
[ https://issues.apache.org/jira/browse/ARROW-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490581#comment-17490581 ] Jonathan Keane commented on ARROW-15299: Have we upstreamed these conclusions to the issue on {remotes}? > [R] investigate {remotes} dependencies "soft" vs TRUE > -- > > Key: ARROW-15299 > URL: https://issues.apache.org/jira/browse/ARROW-15299 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Although the {{remotes::install_deps()}} docs state {{dependencies == TRUE}} > is equivalent to {{{}dependencies == "soft"{}}}, I suspect {{"soft"}} is a > bit more recursive than {{{}TRUE{}}}, resulting in the installation of many > more packages. > {quote}TRUE is shorthand for "Depends", "Imports", "LinkingTo" and > "Suggests". NA is shorthand for "Depends", "Imports" and "LinkingTo" and is > the default. FALSE is shorthand for no dependencies (i.e. just check this > package, not its dependencies). > The value "soft" means the same as TRUE, "hard" means the same as NA. > {quote} > I noticed, when using {{dependencies = "soft"}} that my session was being > clogged up by package installations lasting well over 40 minutes. > I would be good to time box this to a couple of hours. > The direction in which I would go would be to understand if there is any > difference in the size of the dependency trees + come up with a minimal > reproducible example. > Edit (12 January, 2021): the output could be a table comparing > {code:r} > remotes::install_deps(dependencies = TRUE) > remotes::install_deps(dependencies = "hard") > remotes::install_deps(dependencies = c("hard", "Config...")) > remotes::install_deps(dependencies = "soft") > remotes::install_deps(dependencies = c("soft", "Config...")) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction
Jonathan Keane created ARROW-15664: -- Summary: [C++] parquet reader Segfaults with illegal SIMD instruction Key: ARROW-15664 URL: https://issues.apache.org/jira/browse/ARROW-15664 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Reporter: Jonathan Keane Fix For: 7.0.1, 8.0.0 When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run parquet tests (in R at least, though I imagine the pyarrow and C++ will have the same issues!) we get a segfault with an illegal opcode on systems that don't have BMI2 available when trying to read parquet files. (It turns out, the github runners for macos don't have BMI2, so this is easily testable there!) Somehow in the optimization combined with the way our runtime detection code works, the runtime detection we normally use for this fails (though it works just fine with {{-O2}}, {{-O3}}, etc.). When diagnosing this, I created a branch + PR that runs our R tests after installing from brew which can reliably cause this to happen: https://github.com/apache/arrow/pull/12364 other test suites that exercise parquet reading would probably have the same issue (or even C++ tests built with {{-Os}}. Here's a coredump: {code} 2491 Thread_829819 + 2491 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff801c3e00f] + 2491 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff801c424f4] + 2491 void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) (in arrow.so) + 380 [0x109203749] + 2491 arrow::internal::FnOnce::operator()() && (in arrow.so) + 26 [0x109201f30] + 2491 arrow::internal::FnOnce::FnImpl >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4&, unsigned long&, std::__1::shared_ptr > >::invoke() (in arrow.so) + 98 [0x108f125c2] + 2491 parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4::operator()(unsigned long, std::__1::shared_ptr) const (in arrow.so) + 47 [0x108f11ed5] + 2491 parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector > const&, parquet::arrow::ColumnReader*, std::__1::shared_ptr*) (in arrow.so) + 273 [0x108f0c037] + 2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, std::__1::shared_ptr*) (in arrow.so) + 39 [0x108f0733b] + 2491 parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long long) (in arrow.so) + 137 [0x108f0794b] + 2491 parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long long) (in arrow.so) + 442 [0x108f4f53e] + 2491 parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long long) (in arrow.so) + 471 [0x108f50503] + 2491 void parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long long, parquet::internal::LevelInfo, parquet::internal::ValidityBitmapInputOutput*) (in arrow.so) + 250 [0x108fc2a5a] + 2491 long long parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long long, long long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*) (in arrow.so) + 63 [0x108fc34da] + 2491 ??? (in ) [0x61354518] + 2491 _sigtramp (in libsystem_platform.dylib) + 29 [0x7ff801c57e2d] + 2491 sigactionSegv (in libR.dylib) + 649 [0x1042598c9] main.c:625 + 2491 Rstd_ReadConsole (in libR.dylib) + 2042 [0x10435160a] sys-std.c:1044 + 2491 R_SelectEx (in libR.dylib) + 308 [0x104350854] sys-std.c:178 + 2491 __select (in libsystem_kernel.dylib) + 10 [0x7ff801c0de4a] {code} And then a disassembly (where you can see a SHLX that shouldn't be there): {code} Dump of assembler code from 0x13ac6db00 to 0x13ac6db99ff: ... --Type for more, q to quit, c to continue without paging-- 0x00013ac6db82: mov$0x8,%ecx 0x00013ac6db87: sub%rax,%rcx 0x00013ac6db8a: lea0xf1520b(%rip),%rdi# 0x13bb82d9c 0x00013ac6db91: movzbl (%rcx,%rdi,1),%edi 0x00013ac6db95: mov%esi,%ebx 0x00013ac6db97: and%edi,%ebx => 0x00013ac6db99: shlx %rax,%rbx,%rax 0x00013ac6db9e: or 0x18(%r15),%al 0x00013ac6dba2: mov%al,0x18(%r15) 0x00013ac6dba6: cmp%rdx,%rcx 0x00013ac6dba9: jg 0x13ac6dbf5 0x00013ac6dbab: mov%al,(%
[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction
[ https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490940#comment-17490940 ] Jonathan Keane commented on ARROW-15664: If you want to build with {{MinSizeRel}}, you'll need to add https://issues.apache.org/jira/browse/ARROW-15664# {code} elseif("${CMAKE_BUILD_TYPE}" STREQUAL "MINSIZEREL") set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${C_FLAGS_MINSIZEREL}") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_MINSIZEREL}") {code} to https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/cmake_modules/SetupCxxFlags.cmake#L633-L647 for that to work > [C++] parquet reader Segfaults with illegal SIMD instruction > - > > Key: ARROW-15664 > URL: https://issues.apache.org/jira/browse/ARROW-15664 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Jonathan Keane >Priority: Critical > Fix For: 7.0.1, 8.0.0 > > > When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run > parquet tests (in R at least, though I imagine the pyarrow and C++ will have > the same issues!) we get a segfault with an illegal opcode on systems that > don't have BMI2 available when trying to read parquet files. (It turns out, > the github runners for macos don't have BMI2, so this is easily testable > there!) > Somehow in the optimization combined with the way our runtime detection code > works, the runtime detection we normally use for this fails (though it works > just fine with {{-O2}}, {{-O3}}, etc.). > When diagnosing this, I created a branch + PR that runs our R tests after > installing from brew which can reliably cause this to happen: > https://github.com/apache/arrow/pull/12364 other test suites that exercise > parquet reading would probably have the same issue (or even C++ tests built > with {{-Os}}. > Here's a coredump: > {code} > 2491 Thread_829819 > + 2491 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff801c3e00f] > + 2491 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff801c424f4] > + 2491 void* > std::__1::__thread_proxy std::__1::default_delete >, > arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) (in > arrow.so) + 380 [0x109203749] > + 2491 arrow::internal::FnOnce::operator()() && (in arrow.so) > + 26 [0x109201f30] > + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, > parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4&, unsigned long&, > std::__1::shared_ptr > >::invoke() (in > arrow.so) + 98 [0x108f125c2] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4::operator()(unsigned long, > std::__1::shared_ptr) const (in arrow.so) > + 47 [0x108f11ed5] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, > std::__1::shared_ptr*) (in arrow.so) + 273 > [0x108f0c037] > + 2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, > std::__1::shared_ptr*) (in arrow.so) + 39 [0x108f0733b] > + 2491 parquet::arrow::(anonymous > namespace)::LeafReader::LoadBatch(long long) (in arrow.so) + 137 > [0x108f0794b] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecords(long long) (in arrow.so) + 442 [0x108f4f53e] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecordData(long long) (in arrow.so) + 471 [0x108f50503] > + 2491 void > parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long > long, parquet::internal::LevelInfo, > parquet::internal::ValidityBitmapInputOutput*) (in arrow.so) + 250 > [0x108fc2a5a] > + 2491 long long > parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long > long, long long, parquet::internal::LevelInfo, > arrow::internal::FirstTimeBitmapWriter*) (in arrow.so) + 63 [0x108fc34da] > + 2491 ??? (in ) [0x61354518] > + 2491 _sigtramp (in libsystem_platform.dylib) + > 29 [0x7ff801c57e2d] > + 2491 sigactionSegv (in libR.dylib) + 649 > [0x1042598c9] main.c:625 > + 2491 Rstd_ReadConsole (in libR.dylib) + > 2042 [0x1043516
[jira] [Commented] (ARROW-15299) [R] investigate {remotes} dependencies "soft" vs TRUE
[ https://issues.apache.org/jira/browse/ARROW-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490985#comment-17490985 ] Jonathan Keane commented on ARROW-15299: What you've got here is great. It's a clear explanation of how {{c('soft', 'Config/Needs/website')}} doesn't do what we need, {{c('hard', 'Config/Needs/website')}}, and how {{dependencies = TRUE}} is not the same as {{dependencies = hard}} (which could have a clearer message in the docs. > [R] investigate {remotes} dependencies "soft" vs TRUE > -- > > Key: ARROW-15299 > URL: https://issues.apache.org/jira/browse/ARROW-15299 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Although the {{remotes::install_deps()}} docs state {{dependencies == TRUE}} > is equivalent to {{{}dependencies == "soft"{}}}, I suspect {{"soft"}} is a > bit more recursive than {{{}TRUE{}}}, resulting in the installation of many > more packages. > {quote}TRUE is shorthand for "Depends", "Imports", "LinkingTo" and > "Suggests". NA is shorthand for "Depends", "Imports" and "LinkingTo" and is > the default. FALSE is shorthand for no dependencies (i.e. just check this > package, not its dependencies). > The value "soft" means the same as TRUE, "hard" means the same as NA. > {quote} > I noticed, when using {{dependencies = "soft"}} that my session was being > clogged up by package installations lasting well over 40 minutes. > I would be good to time box this to a couple of hours. > The direction in which I would go would be to understand if there is any > difference in the size of the dependency trees + come up with a minimal > reproducible example. > Edit (12 January, 2021): the output could be a table comparing > {code:r} > remotes::install_deps(dependencies = TRUE) > remotes::install_deps(dependencies = "hard") > remotes::install_deps(dependencies = c("hard", "Config...")) > remotes::install_deps(dependencies = "soft") > remotes::install_deps(dependencies = c("soft", "Config...")) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction
[ https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491139#comment-17491139 ] Jonathan Keane commented on ARROW-15664: > Can it be reproduced without Homebrew? Yes, if you build arrow with `CMAKE_RELEASE_TYPE=MinSizeRel` (and possibly even just `-Os`) you can experience the segfault > [C++] parquet reader Segfaults with illegal SIMD instruction > - > > Key: ARROW-15664 > URL: https://issues.apache.org/jira/browse/ARROW-15664 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Jonathan Keane >Priority: Critical > Fix For: 7.0.1, 8.0.0 > > > When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run > parquet tests (in R at least, though I imagine the pyarrow and C++ will have > the same issues!) we get a segfault with an illegal opcode on systems that > don't have BMI2 available when trying to read parquet files. (It turns out, > the github runners for macos don't have BMI2, so this is easily testable > there!) > Somehow in the optimization combined with the way our runtime detection code > works, the runtime detection we normally use for this fails (though it works > just fine with {{-O2}}, {{-O3}}, etc.). > When diagnosing this, I created a branch + PR that runs our R tests after > installing from brew which can reliably cause this to happen: > https://github.com/apache/arrow/pull/12364 other test suites that exercise > parquet reading would probably have the same issue (or even C++ tests built > with {{-Os}}. > Here's a coredump: > {code} > 2491 Thread_829819 > + 2491 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff801c3e00f] > + 2491 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff801c424f4] > + 2491 void* > std::__1::__thread_proxy std::__1::default_delete >, > arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) (in > arrow.so) + 380 [0x109203749] > + 2491 arrow::internal::FnOnce::operator()() && (in arrow.so) > + 26 [0x109201f30] > + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, > parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4&, unsigned long&, > std::__1::shared_ptr > >::invoke() (in > arrow.so) + 98 [0x108f125c2] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4::operator()(unsigned long, > std::__1::shared_ptr) const (in arrow.so) > + 47 [0x108f11ed5] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, > std::__1::shared_ptr*) (in arrow.so) + 273 > [0x108f0c037] > + 2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, > std::__1::shared_ptr*) (in arrow.so) + 39 [0x108f0733b] > + 2491 parquet::arrow::(anonymous > namespace)::LeafReader::LoadBatch(long long) (in arrow.so) + 137 > [0x108f0794b] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecords(long long) (in arrow.so) + 442 [0x108f4f53e] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecordData(long long) (in arrow.so) + 471 [0x108f50503] > + 2491 void > parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long > long, parquet::internal::LevelInfo, > parquet::internal::ValidityBitmapInputOutput*) (in arrow.so) + 250 > [0x108fc2a5a] > + 2491 long long > parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long > long, long long, parquet::internal::LevelInfo, > arrow::internal::FirstTimeBitmapWriter*) (in arrow.so) + 63 [0x108fc34da] > + 2491 ??? (in ) [0x61354518] > + 2491 _sigtramp (in libsystem_platform.dylib) + > 29 [0x7ff801c57e2d] > + 2491 sigactionSegv (in libR.dylib) + 649 > [0x1042598c9] main.c:625 > + 2491 Rstd_ReadConsole (in libR.dylib) + > 2042 [0x10435160a] sys-std.c:1044 > + 2491 R_SelectEx (in libR.dylib) + 308 > [0x104350854] sys-std.c:178 > + 2491 __select (in > libsystem_kernel.dylib) + 10 [0x7ff801c0de4a] > {code} > And then a disassembly (where you can see
[jira] [Created] (ARROW-15673) [R] Error gracefully if DuckDB isn't installed
Jonathan Keane created ARROW-15673: -- Summary: [R] Error gracefully if DuckDB isn't installed Key: ARROW-15673 URL: https://issues.apache.org/jira/browse/ARROW-15673 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Jonathan Keane Assignee: Dragoș Moldovan-Grünfeld Right now, the function {{to_duckdb()}} doesn't check to confirm that DuckDB is available. The error message isn't the worst (it'll mention {{duckdb::...}} not being found) it would be nicer to specifically tell folks they need to install the duckdb package. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15606) [CI] [R] Add brew build that exercises the R package
[ https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15606. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12364 [https://github.com/apache/arrow/pull/12364] > [CI] [R] Add brew build that exercises the R package > > > Key: ARROW-15606 > URL: https://issues.apache.org/jira/browse/ARROW-15606 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
Jonathan Keane created ARROW-15678: -- Summary: [C++][CI] a crossbow job with MinRelSize enabled Key: ARROW-15678 URL: https://issues.apache.org/jira/browse/ARROW-15678 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15664) [C++] parquet reader Segfaults with illegal SIMD instruction
[ https://issues.apache.org/jira/browse/ARROW-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492197#comment-17492197 ] Jonathan Keane commented on ARROW-15664: https://github.com/apache/arrow/pull/12422 has a crossbow build that exercises {{-DCMAKE_RELEASE_TYPE=MinSizeRel}} outside of brew > [C++] parquet reader Segfaults with illegal SIMD instruction > - > > Key: ARROW-15664 > URL: https://issues.apache.org/jira/browse/ARROW-15664 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Jonathan Keane >Priority: Critical > Fix For: 7.0.1, 8.0.0 > > > When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run > parquet tests (in R at least, though I imagine the pyarrow and C++ will have > the same issues!) we get a segfault with an illegal opcode on systems that > don't have BMI2 available when trying to read parquet files. (It turns out, > the github runners for macos don't have BMI2, so this is easily testable > there!) > Somehow in the optimization combined with the way our runtime detection code > works, the runtime detection we normally use for this fails (though it works > just fine with {{-O2}}, {{-O3}}, etc.). > When diagnosing this, I created a branch + PR that runs our R tests after > installing from brew which can reliably cause this to happen: > https://github.com/apache/arrow/pull/12364 other test suites that exercise > parquet reading would probably have the same issue (or even C++ tests built > with {{-Os}}. > Here's a coredump: > {code} > 2491 Thread_829819 > + 2491 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff801c3e00f] > + 2491 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff801c424f4] > + 2491 void* > std::__1::__thread_proxy std::__1::default_delete >, > arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) (in > arrow.so) + 380 [0x109203749] > + 2491 arrow::internal::FnOnce::operator()() && (in arrow.so) > + 26 [0x109201f30] > + 2491 arrow::internal::FnOnce ()>::FnImpl arrow::Future >&, > parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4&, unsigned long&, > std::__1::shared_ptr > >::invoke() (in > arrow.so) + 98 [0x108f125c2] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4::operator()(unsigned long, > std::__1::shared_ptr) const (in arrow.so) > + 47 [0x108f11ed5] > + 2491 parquet::arrow::(anonymous > namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector std::__1::allocator > const&, parquet::arrow::ColumnReader*, > std::__1::shared_ptr*) (in arrow.so) + 273 > [0x108f0c037] > + 2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, > std::__1::shared_ptr*) (in arrow.so) + 39 [0x108f0733b] > + 2491 parquet::arrow::(anonymous > namespace)::LeafReader::LoadBatch(long long) (in arrow.so) + 137 > [0x108f0794b] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecords(long long) (in arrow.so) + 442 [0x108f4f53e] > + 2491 parquet::internal::(anonymous > namespace)::TypedRecordReader > >::ReadRecordData(long long) (in arrow.so) + 471 [0x108f50503] > + 2491 void > parquet::internal::standard::DefLevelsToBitmapSimd(short const*, long > long, parquet::internal::LevelInfo, > parquet::internal::ValidityBitmapInputOutput*) (in arrow.so) + 250 > [0x108fc2a5a] > + 2491 long long > parquet::internal::standard::DefLevelsBatchToBitmap(short const*, long > long, long long, parquet::internal::LevelInfo, > arrow::internal::FirstTimeBitmapWriter*) (in arrow.so) + 63 [0x108fc34da] > + 2491 ??? (in ) [0x61354518] > + 2491 _sigtramp (in libsystem_platform.dylib) + > 29 [0x7ff801c57e2d] > + 2491 sigactionSegv (in libR.dylib) + 649 > [0x1042598c9] main.c:625 > + 2491 Rstd_ReadConsole (in libR.dylib) + > 2042 [0x10435160a] sys-std.c:1044 > + 2491 R_SelectEx (in libR.dylib) + 308 > [0x104350854] sys-std.c:178 > + 2491 __select (in > libsystem_kernel.dylib) + 10 [0x7ff801c0de4a] > {code} > And then a disassembly (where you can see a SHLX that shouldn't be there):
[jira] [Resolved] (ARROW-15013) [R] Expose concatenate at the R level
[ https://issues.apache.org/jira/browse/ARROW-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15013. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12324 [https://github.com/apache/arrow/pull/12324] > [R] Expose concatenate at the R level > - > > Key: ARROW-15013 > URL: https://issues.apache.org/jira/browse/ARROW-15013 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > Currently {{arrow::Concatenate()}} is not exposed at the R level. I imagine > the preferred way to deal with this from a user perspective is to use > {{ChunkedArray$create()}} for this; however, from a developer perspective > it's very difficult to replicate this functionality. For example, another > package using the Arrow C API might want to use the arrow R package to > concatenate record batches complex nested types instead of implementing it > themselves. > Example usage in C++: > https://github.com/apache/arrow/blob/9cf4275a19c994879172e5d3b03ade9a96a10721/r/src/r_to_arrow.cpp#L1215 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15701) [C++][R] Should month allow integer inputs?
[ https://issues.apache.org/jira/browse/ARROW-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493283#comment-17493283 ] Jonathan Keane commented on ARROW-15701: This also should be accomplishable in the R bindings if we want to do it there quickly to unblock things before we decide/wait for the C++ implementation. Basically, somewhere in https://github.com/apache/arrow/blob/1b9e76c6b07d557249a949c7c98d00997513d5cc/r/R/dplyr-funcs-datetime.R#L108-L119 check the input type and do casting or follow a slightly different path and return an expression that creates the thing we need. > [C++][R] Should month allow integer inputs? > --- > > Key: ARROW-15701 > URL: https://issues.apache.org/jira/browse/ARROW-15701 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > In R, more specifically in {{{}lubridate{}}}, {{month()}} can be used both to > get and set the corresponding component of a date. This means {{month()}} > accepts integer inputs. > {code:r} > suppressPackageStartupMessages(library(lubridate)) > month(1:12) > #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 > month(1:13) > #> Error in month.numeric(1:13): Values are not in 1:12 > {code} > Solving this would allow us to implement bindings such as `semester()` in a > manner closer to {{{}lubridate{}}}. > {code:r} > suppressPackageStartupMessages(library(dplyr)) > suppressPackageStartupMessages(library(lubridate)) > test_df <- tibble( > month_as_int = c(1:12, NA), > month_as_char_pad = ifelse(month_as_int < 10, paste0("0", month_as_int), > month_as_int), > dates = as.Date(paste0("2021-", month_as_char_pad, "-15")) > ) > test_df %>% > mutate( > sem_date = semester(dates), > sem_month_as_int = semester(month_as_int)) > #> # A tibble: 13 × 5 > #>month_as_int month_as_char_pad dates sem_date sem_month_as_int > #> > #> 11 012021-01-1511 > #> 22 022021-02-1511 > #> 33 032021-03-1511 > #> 44 042021-04-1511 > #> 55 052021-05-1511 > #> 66 062021-06-1511 > #> 77 072021-07-1522 > #> 88 082021-08-1522 > #> 99 092021-09-1522 > #> 10 10 102021-10-1522 > #> 11 11 112021-11-1522 > #> 12 12 122021-12-1522 > #> 13 NA NA NA NA > {code} > Currently attempts to use {{month()}} with integer inputs errors with: > {code:r} > Function 'month' has no kernel matching input types (array[int32]) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch
[ https://issues.apache.org/jira/browse/ARROW-15468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15468. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12366 [https://github.com/apache/arrow/pull/12366] > [R] [CI] A crossbow job that tests against DuckDB's dev branch > -- > > Key: ARROW-15468 > URL: https://issues.apache.org/jira/browse/ARROW-15468 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > It would be good to test against DuckDB's dev branch to warn us if there are > impending changes that break something. > While we're doing this, we should clean up some of the Currently some of our > jobs do already > https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51 > > We should clean this up so that _generally_ builds use the released DuckDB, > but we can toggle dev DuckDB (and run a separate build that uses the dev > DuckDB optionally) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer
Jonathan Keane created ARROW-15708: -- Summary: [R] [CI] skip snappy encoded parquets on clang sanitizer Key: ARROW-15708 URL: https://issues.apache.org/jira/browse/ARROW-15708 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer
[ https://issues.apache.org/jira/browse/ARROW-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15708: --- Description: Showing up in our nightlies with: {code} #17 0x7f5004603315 in arrow::Future, std::__1::allocator > > > arrow::internal::OptionalParallelForAsync, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4&, std::__1::shared_ptr, std::__1::shared_ptr >(bool, std::__1::vector, std::__1::allocator > >, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) /arrow/cpp/src/arrow/util/parallel.h:95:7 #18 0x7f50046026d9 in parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr, std::__1::vector > const&, std::__1::vector > const&, arrow::internal::Executor*) /arrow/cpp/src/parquet/arrow/reader.cc:1198:10 #19 0x7f5004601994 in parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*, std::__1::shared_ptr, int, std::__1::vector > const&) /arrow/cpp/src/parquet/arrow/reader.cc:1090:18 #20 0x7f5004658330 in parquet::arrow::RowGroupGenerator::operator()()::'lambda'()::operator()() const /arrow/cpp/src/parquet/arrow/reader.cc:1064:14 #21 0x7f500465806f in std::__1::enable_if > ()> > >::value, void>::type arrow::detail::ContinueFuture::operator() > ()> >, arrow::Future > ()> > >(arrow::Future > ()> >, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&) const /arrow/cpp/src/arrow/util/future.h:177:9 #22 0x7f5004657c03 in void arrow::detail::ContinueFuture::IgnoringArgsIf > ()> >, arrow::internal::Empty const&>(std::__1::integral_constant, arrow::Future > ()> >&&, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&, arrow::internal::Empty const&) const /arrow/cpp/src/arrow/util/future.h:186:5 #23 0x7f500465797a in arrow::Future::ThenOnComplete::PassthruOnFailure >::operator()(arrow::Result const&) && /arrow/cpp/src/arrow/util/future.h:599:25 #24 0x7f5005c95bfe in arrow::internal::FnOnce::operator()(arrow::FutureImpl const&) && /arrow/cpp/src/arrow/util/functional.h:140:17 #25 0x7f5005c948f5 in arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr const&, arrow::FutureImpl::CallbackRecord&&, bool) /arrow/cpp/src/arrow/util/future.cc:298:7 #26 0x7f5005c94017 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) /arrow/cpp/src/arrow/util/future.cc:327:7 #27 0x7f50042731fe in arrow::Future::DoMarkFinished(arrow::Result) /arrow/cpp/src/arrow/util/future.h:712:14 #28 0x7f5004272df8 in void arrow::Future::MarkFinished(arrow::Status) /arrow/cpp/src/arrow/util/future.h:463:12 #29 0x7f500465244b in arrow::Future arrow::internal::Executor::DoTransfer, arrow::Status>(arrow::Future, bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) /arrow/cpp/src/arrow/util/thread_pool.h:217:21 #30 0x7f500465244b in arrow::Future::WrapStatusyOnComplete::Callback arrow::internal::Executor::DoTransfer, arrow::Status>(arrow::Future, bool)::'lambda'(arrow::Status const&)>::operator()(arrow::FutureImpl const&) && /arrow/cpp/src/arrow/util/future.h:509:9 #31 0x7f5005c95bfe in arrow::internal::FnOnce::operator()(arrow::FutureImpl const&) && /arrow/cpp/src/arrow/util/functional.h:140:17 #32 0x7f5005d6f147 in arrow::internal::FnOnce::operator()() && /arrow/cpp/src/arrow/util/functional.h:140:17 #33 0x7f5005d6dc65 in arrow::internal::WorkerLoop(std::__1::shared_ptr, std::__1::__list_iterator) /arrow/cpp/src/arrow/util/thread_pool.cc:178:11 #34 0x7f5005d6d67b in arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3::operator()() const /arrow/cpp/src/arrow/util/thread_pool.cc:349:7 #35 0x7f5005d6d67b in decltype(std::__1::forward(fp)()) std::__1::__invoke(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&) /usr/bin/../include/c++/v1/type_traits:3899:1 #36 0x7f5005d6cf9c in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/bin/../include/c++/v1/thread:291:5 #37 0x7f50182343f8 in start_thread (/lib64/libpthread.so.0+0x93f8) #38 0x7f50180774c2 in clone (/lib64/libc.so.6+0x1014c2) {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19810&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=4096 > [R] [CI] skip snappy encoded parquets on clang sanitizer > > > Key: ARROW-15708 > URL: https://issues.apache.org/jira/browse/ARROW-15708 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >
[jira] [Commented] (ARROW-15726) [R] Support push-down projection/filtering in datasets / dplyr
[ https://issues.apache.org/jira/browse/ARROW-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494767#comment-17494767 ] Jonathan Keane commented on ARROW-15726: This might be coincidence (though I suspect now...) Our dataset benchmarks are suddenly failing (with at least some of the failures being caused by attempting to read data that shouldn't be being read [1]) Elena has helped narrow down the range of possible commits that this could have happened in: The first commit this might have happened in is https://github.com/apache/arrow/commit/a935c81b595d24179e115d64cda944efa93aa0e0 and https://github.com/apache/arrow/commit/afaa92e7e4289d6e4f302cc91810368794e8092b it for sure happens in, so it was that commit or before. Here's an example buildkite log: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/145#ebd7ea7a-3fad-4865-9e73-49200d89ddd6/6-3230 [1] this is a bit in the weeds, so bear with me: The dataset we use for these benchmarks includes data in an early year that causes a schema failure {{Unsupported cast from string to null using function cast_null}}. The benchmarks that we wrote cleverly avoid selecting anything from this first year (so if filter pushdown is working, we don't get the error). It _has_ been working (for a while now! even with exec nodes), but suddenly about three days ago, that has actually stopped working and the benchmarks started failing > [R] Support push-down projection/filtering in datasets / dplyr > -- > > Key: ARROW-15726 > URL: https://issues.apache.org/jira/browse/ARROW-15726 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Weston Pace >Priority: Major > > The following query should read a single column from the target parquet file. > {noformat} > open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) > %>% collect() > {noformat} > Furthermore, it should apply a pushdown filter to the source node allowing > parquet row groups to potentially filter out target data. > At the moment it creates the following exec plan: > {noformat} > 3:SinkNode{} > 2:ProjectNode{projection=[l_tax]} > 1:FilterNode{filter=(l_tax < 0.01)} > 0:SourceNode{} > {noformat} > There is no projection or filter in the source node. As a result we end up > reading much more data from disk (the entire file) than we need to (at most a > single column). > This _could_ be fixed via heuristics in the dplyr code. However, it may > quickly get complex (for example, the project comes after the filter, so you > need to make sure you push down a projection that includes both the columns > accessed by the filter and the columns accessed by the projection OR can you > push down the projection through a join [yes you can], how do you know which > columns to apply to which source node). > A more complete fix would be to call into some kind of 3rd party optimizer > (e.g. calcite) after the plan has been created by dplyr but before it is > passed to the execution engine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15731) [C++] Enable joins when data contains a list column
[ https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15731: --- Summary: [C++] Enable joins when data contains a list column (was: Enable joins when data contains a list column) > [C++] Enable joins when data contains a list column > --- > > Key: ARROW-15731 > URL: https://issues.apache.org/jira/browse/ARROW-15731 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stephanie Hazlitt >Priority: Major > > Currently Arrow joins with data that contain a list column errors, even when > the list column is not a join key: > ``` r > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), > jedi = c(FALSE, TRUE)) > arrow_table(starwars) %>% > left_join(jedi) %>% > collect() > #> Error in `handle_csv_read_error()`: > #> ! Invalid: Data type list is not supported in join non-key > field > ``` > The ability to join would be a useful enhancement for workflows with tabular > data where list columns can be common, and for geospatial workflows where > geometry columns are stored as `list` or `fixed_size_list` (thanks > [~paleolimbot] for mentioning that use case). > Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15731) [C++] Enable joins when data contains a list column
[ https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15731: --- Description: Currently Arrow joins with data that contain a list column errors, even when the list column is not a join key: {code} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), jedi = c(FALSE, TRUE)) arrow_table(starwars) %>% left_join(jedi) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Data type list is not supported in join non-key field {code} The ability to join would be a useful enhancement for workflows with tabular data where list columns can be common, and for geospatial workflows where geometry columns are stored as `list` or `fixed_size_list` (thanks [~paleolimbot] for mentioning that use case). Related discussion here: ARROW-14519 was: Currently Arrow joins with data that contain a list column errors, even when the list column is not a join key: ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), jedi = c(FALSE, TRUE)) arrow_table(starwars) %>% left_join(jedi) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Data type list is not supported in join non-key field ``` The ability to join would be a useful enhancement for workflows with tabular data where list columns can be common, and for geospatial workflows where geometry columns are stored as `list` or `fixed_size_list` (thanks [~paleolimbot] for mentioning that use case). Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519 > [C++] Enable joins when data contains a list column > --- > > Key: ARROW-15731 > URL: https://issues.apache.org/jira/browse/ARROW-15731 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stephanie Hazlitt >Priority: Major > > Currently Arrow joins with data that contain a list column errors, even when > the list column is not a join key: > {code} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), > jedi = c(FALSE, TRUE)) > arrow_table(starwars) %>% > left_join(jedi) %>% > collect() > #> Error in `handle_csv_read_error()`: > #> ! Invalid: Data type list is not supported in join non-key > field > {code} > The ability to join would be a useful enhancement for workflows with tabular > data where list columns can be common, and for geospatial workflows where > geometry columns are stored as `list` or `fixed_size_list` (thanks > [~paleolimbot] for mentioning that use case). > Related discussion here: ARROW-14519 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15708) [R] [CI] skip snappy encoded parquets on clang sanitizer
[ https://issues.apache.org/jira/browse/ARROW-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15708. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12443 [https://github.com/apache/arrow/pull/12443] > [R] [CI] skip snappy encoded parquets on clang sanitizer > > > Key: ARROW-15708 > URL: https://issues.apache.org/jira/browse/ARROW-15708 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Showing up in our nightlies with: > {code} > #17 0x7f5004603315 in > arrow::Future, > std::__1::allocator > > > > arrow::internal::OptionalParallelForAsync namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4&, > std::__1::shared_ptr, > std::__1::shared_ptr >(bool, > std::__1::vector, > std::__1::allocator > > >, parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) > /arrow/cpp/src/arrow/util/parallel.h:95:7 > #18 0x7f50046026d9 in parquet::arrow::(anonymous > namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr namespace)::FileReaderImpl>, std::__1::vector > > const&, std::__1::vector > const&, > arrow::internal::Executor*) /arrow/cpp/src/parquet/arrow/reader.cc:1198:10 > #19 0x7f5004601994 in > parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*, > std::__1::shared_ptr, > int, std::__1::vector > const&) > /arrow/cpp/src/parquet/arrow/reader.cc:1090:18 > #20 0x7f5004658330 in > parquet::arrow::RowGroupGenerator::operator()()::'lambda'()::operator()() > const /arrow/cpp/src/parquet/arrow/reader.cc:1064:14 > #21 0x7f500465806f in > std::__1::enable_if > > ()> > >::value, void>::type > arrow::detail::ContinueFuture::operator() > arrow::Future > > ()> >, > arrow::Future > > ()> > > >(arrow::Future > > ()> >, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&) > const /arrow/cpp/src/arrow/util/future.h:177:9 > #22 0x7f5004657c03 in void > arrow::detail::ContinueFuture::IgnoringArgsIf > arrow::Future > > ()> >, arrow::internal::Empty const&>(std::__1::integral_constant true>, > arrow::Future > > ()> >&&, parquet::arrow::RowGroupGenerator::operator()()::'lambda'()&&, > arrow::internal::Empty const&) const /arrow/cpp/src/arrow/util/future.h:186:5 > #23 0x7f500465797a in > arrow::Future::ThenOnComplete > arrow::Future::PassthruOnFailure > >::operator()(arrow::Result const&) && > /arrow/cpp/src/arrow/util/future.h:599:25 > #24 0x7f5005c95bfe in arrow::internal::FnOnce const&)>::operator()(arrow::FutureImpl const&) && > /arrow/cpp/src/arrow/util/functional.h:140:17 > #25 0x7f5005c948f5 in > arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr > const&, arrow::FutureImpl::CallbackRecord&&, bool) > /arrow/cpp/src/arrow/util/future.cc:298:7 > #26 0x7f5005c94017 in > arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) > /arrow/cpp/src/arrow/util/future.cc:327:7 > #27 0x7f50042731fe in > arrow::Future::DoMarkFinished(arrow::Result) > /arrow/cpp/src/arrow/util/future.h:712:14 > #28 0x7f5004272df8 in void > arrow::Future::MarkFinished void>(arrow::Status) /arrow/cpp/src/arrow/util/future.h:463:12 > #29 0x7f500465244b in arrow::Future > arrow::internal::Executor::DoTransfer arrow::Future, > arrow::Status>(arrow::Future, > bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) > /arrow/cpp/src/arrow/util/thread_pool.h:217:21 > #30 0x7f500465244b in > arrow::Future::WrapStatusyOnComplete::Callback > arrow::internal::Executor::DoTransfer arrow::Future, > arrow::Status>(arrow::Future, > bool)::'lambda'(arrow::Status const&)>::operator()(arrow::FutureImpl const&) > && /arrow/cpp/src/arrow/util/future.h:509:9 > #31 0x7f5005c95bfe in arrow::internal::FnOnce const&)>::operator()(arrow::FutureImpl const&) && > /arrow/cpp/src/arrow/util/functional.h:140:17 > #32 0x7f5005d6f147 in arrow::internal::FnOnce::operator()() && > /arrow/cpp/src/arrow/util/functional.h:140:17 > #33 0x7f5005d6dc65 in > arrow::internal::WorkerLoop(std::__1::shared_ptr, > std::__1::__list_iterator) > /arrow/cpp/src/arrow/util/thread_pool.cc:178:11 > #34 0x7f5005d6d67b in > arrow::internal::Thread
[jira] [Resolved] (ARROW-14817) [R] Implement bindings for lubridate::tz
[ https://issues.apache.org/jira/browse/ARROW-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14817. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12357 [https://github.com/apache/arrow/pull/12357] > [R] Implement bindings for lubridate::tz > > > Key: ARROW-14817 > URL: https://issues.apache.org/jira/browse/ARROW-14817 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > This can be achieved via strftime -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14815) [R] Implement bindings for lubridate::semester
[ https://issues.apache.org/jira/browse/ARROW-14815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14815. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12429 [https://github.com/apache/arrow/pull/12429] > [R] Implement bindings for lubridate::semester > -- > > Key: ARROW-14815 > URL: https://issues.apache.org/jira/browse/ARROW-14815 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14826) [R] Implement bindings for lubridate::dst()
[ https://issues.apache.org/jira/browse/ARROW-14826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14826. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12431 [https://github.com/apache/arrow/pull/12431] > [R] Implement bindings for lubridate::dst() > --- > > Key: ARROW-14826 > URL: https://issues.apache.org/jira/browse/ARROW-14826 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15673) [R] Error gracefully if DuckDB isn't installed
[ https://issues.apache.org/jira/browse/ARROW-15673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15673. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12486 [https://github.com/apache/arrow/pull/12486] > [R] Error gracefully if DuckDB isn't installed > -- > > Key: ARROW-15673 > URL: https://issues.apache.org/jira/browse/ARROW-15673 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 6h > Remaining Estimate: 0h > > Right now, the function {{to_duckdb()}} doesn't check to confirm that DuckDB > is available. The error message isn't the worst (it'll mention > {{duckdb::...}} not being found) it would be nicer to specifically tell folks > they need to install the duckdb package. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15697) [R] Add logo and meta tags to pkgdown site
[ https://issues.apache.org/jira/browse/ARROW-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15697. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12439 [https://github.com/apache/arrow/pull/12439] > [R] Add logo and meta tags to pkgdown site > -- > > Key: ARROW-15697 > URL: https://issues.apache.org/jira/browse/ARROW-15697 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Danielle Navarro >Assignee: Danielle Navarro >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The pkgdown site currently doesn't use the Arrow logo and doesn't have nice > social media preview images -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498144#comment-17498144 ] Jonathan Keane commented on ARROW-15785: Do the Python [1] and R [2] benchmarks for single file reads do this? Oddly(?) The python benchmarks do show a jump around January: https://conbench.ursa.dev/benchmarks/8c5cc1a939d8485eb6c42af83f82c8c0/ https://conbench.ursa.dev/benchmarks/1b8d2dae6f664fd19579071a7cf7766b/ But the corresponding R ones do not: https://conbench.ursa.dev/benchmarks/ca493bf17af84ae5babd97f385b69afc/ [1] https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/file_benchmark.py [2] https://github.com/ursacomputing/arrowbench/blob/main/R/bm-read-file.R > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146 ] Jonathan Keane commented on ARROW-15785: I think this is the PR that introduced the regression (though I might be totally off or it's a different one...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498146#comment-17498146 ] Jonathan Keane edited comment on ARROW-15785 at 2/25/22, 2:16 PM: -- I think this is the PR that introduced the regression (though I might be totally off or it's a different regression...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) was (Author: jonkeane): I think this is the PR that introduced the regression (though I might be totally off or it's a different one...) https://github.com/apache/arrow/pull/11991#issuecomment-1009216946 And the conbench run: https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ We should probably have the conbench bot alert more loudly that there are regressions of this magnitude. That 5% there is supposed to indicate that there's an issue, but we might have that set too low such that there's alarm fatigue or|and we should alert louder when there are this many high-change benchmarks (e.g. the file-read benchmark z-scores range from a -76 to -759, and we alert at -5) > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads
[ https://issues.apache.org/jira/browse/ARROW-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498464#comment-17498464 ] Jonathan Keane commented on ARROW-15785: I've also raised https://github.com/conbench/conbench/issues/307 since this should have been alerted a bit more loudly IMO > [Benchmarks] Add conbench benchmark for single-file parquet reads > - > > Key: ARROW-15785 > URL: https://issues.apache.org/jira/browse/ARROW-15785 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Release 7.0.0 introduced a regression in parquet single file reads. We > should add a macro-level benchmark that does single-file reads to help us > detect this in the future. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-13616) [R] Cheat Sheet Structure
[ https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-13616. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12445 [https://github.com/apache/arrow/pull/12445] > [R] Cheat Sheet Structure > - > > Key: ARROW-13616 > URL: https://issues.apache.org/jira/browse/ARROW-13616 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 5.0.0 >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Assignee: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Hi > I've created a folder on Google Drive that contains: > * SVG (Inkscape) drafts for the cheat sheet > * Arrow hex icon (SVG) > * *A document with the proposed text, please feel free to comment here* > Link: > [https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing] > Please open it and I'll give access to anyone who wants. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15599) [R] Convert a column as a sub-second timestamp from CSV file with the `T` col type option
[ https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15599. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12474 [https://github.com/apache/arrow/pull/12474] > [R] Convert a column as a sub-second timestamp from CSV file with the `T` col > type option > - > > Key: ARROW-15599 > URL: https://issues.apache.org/jira/browse/ARROW-15599 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 6.0.1 > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I tried to read the csv column type as timestamp, but I could only get it to > work well when `col_types` was not specified. > I'm sorry if I missed something and this is the expected behavior. (It would > be great if you could add an example with `col_types` in the documentation.) > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > t_string <- tibble::tibble( > x = "2018-10-07 19:04:05.005" > ) > write_csv_arrow(t_string, "tmp.csv") > read_csv_arrow( > "tmp.csv", > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "?", > skip = 1, > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > skip = 1, > as_data_frame = FALSE > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > as_data_frame = FALSE, > skip = 1, > timestamp_parsers = "%Y-%m-%d %H:%M:%S" > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15678: -- Assignee: (was: Jonathan Keane) > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 8h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500305#comment-17500305 ] Jonathan Keane commented on ARROW-15678: The pull request linked has the starts of this — but there's still an unidentified segfault in one of the tests > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 8h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15798) [R][C++] Discussion: Plans for date casting from int to support an origin option?
[ https://issues.apache.org/jira/browse/ARROW-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500810#comment-17500810 ] Jonathan Keane commented on ARROW-15798: In response to what Dragoș posted there, I do wonder if date64 being turned into POSIXct when it comes back to R. That seems a bit off. Though there is some precision lost going from integer of MS to a float with fractional days (due to float imprecision) if we did date64 -> date backed by a float, but that has at least logical type consistency there. Thoughts? > [R][C++] Discussion: Plans for date casting from int to support an origin > option? > - > > Key: ARROW-15798 > URL: https://issues.apache.org/jira/browse/ARROW-15798 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > 2 questions: > * plans to support an origin option for int -> date32 casting? > * plans to support double -> date32 casting? > === > Currently the casting from integer to date works, but assumes epoch > (1970-01-01) as the origin. > {code:r} > > a <- Array$create(32L) > > a$cast(date32()) > Array > > [ > 1970-02-02 > ] > {code} > Would it make sense to have an {{origin}} option that would allow the user to > fine tune the casting? For example, in R the {{base::as.Date()}} function has > such an argument > {code:r} > > as.Date(32, origin = "1970-01-02") > [1] "1970-02-03" > {code} > We have a potential workaround in R (once we support date & duration > arithmetic), but I was wondering if there might me more general interest for > this. > A secondary aspect (as my R example shows) R support casting to date not only > from integers, but also doubles. Would there be interesting in that? Need be > I can split this into several tickets. > Are there any plans in either of these 2 directions? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14808) [R] Implement bindings for lubridate::date
[ https://issues.apache.org/jira/browse/ARROW-14808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14808. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12433 [https://github.com/apache/arrow/pull/12433] > [R] Implement bindings for lubridate::date > -- > > Key: ARROW-14808 > URL: https://issues.apache.org/jira/browse/ARROW-14808 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 13h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15838) [C++] Key column behavior in joins
Jonathan Keane created ARROW-15838: -- Summary: [C++] Key column behavior in joins Key: ARROW-15838 URL: https://issues.apache.org/jira/browse/ARROW-15838 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane By default in dplyr (and possibly in pandas too?) coalesces the key column for full joins to be the (non-null) values from both key columns: {code} > left <- tibble::tibble( key = c(1, 2), A = c(0, 1), ) left_tab <- Table$create(left) > right <- tibble::tibble( key = c(2, 3), B = c(0, 1), ) right_tab <- Table$create(right) > left %>% full_join(right) Joining, by = "key" # A tibble: 3 × 3 key A B 1 1 0NA 2 2 1 0 3 3NA 1 > left_tab %>% full_join(right_tab) %>% collect() # A tibble: 3 × 3 key A B 1 2 1 0 2 1 0NA 3NANA 1 {code} And right join, we would expect the key from the right table to be in the result, but we get the key from the left instead: {code} > left <- tibble::tibble( key = c(1, 2), A = c(0, 1), ) left_tab <- Table$create(left) > right <- tibble::tibble( key = c(2, 3), B = c(0, 1), ) right_tab <- Table$create(right) > left %>% right_join(right) Joining, by = "key" # A tibble: 2 × 3 key A B 1 2 1 0 2 3NA 1 > left_tab %>% right_join(right_tab) %>% collect() # A tibble: 2 × 3 key A B 1 2 1 0 2NANA 1 {code} Additionally, we should be able to keep both key columns with an option (cf https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15743) [R] `skip` not connected up to `skip_rows` on open_dataset despite error messages indicating otherwise
[ https://issues.apache.org/jira/browse/ARROW-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15743. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12523 [https://github.com/apache/arrow/pull/12523] > [R] `skip` not connected up to `skip_rows` on open_dataset despite error > messages indicating otherwise > -- > > Key: ARROW-15743 > URL: https://issues.apache.org/jira/browse/ARROW-15743 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > If I open a dataset of CSVs with a schema, the error message tells me to > supply {{`skip = 1`}} if my data contains a header row (to prevent it being > read in as data), but only {{skip_rows = 1}} actually works. > {code:r} > library(arrow) > library(dplyr) > td <- tempfile() > dir.create(td) > write_dataset(mtcars, td, format = "csv") > schema <- schema(mpg = float64(), cyl = float64(), disp = float64(), hp = > float64(), > drat = float64(), wt = float64(), qsec = float64(), vs = float64(), > am = float64(), gear = float64(), carb = float64()) > open_dataset(td, format = "csv", schema = schema) %>% > collect() > #> Error in `handle_csv_read_error()`: > #> ! Invalid: Could not open CSV input source > '/tmp/RtmppZbpeF/file6cec135ed29c/part-0.csv': Invalid: In CSV column #0: Row > #1: CSV conversion error to double: invalid value 'mpg' > #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550 decoder_.Decode(data, > size, quoted, &value) > #> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123 status > #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554 > parser.VisitColumn(col_index, visit) > #> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:463 > arrow::internal::UnwrapOrRaise(maybe_decoded_arrays) > #> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:445 > iterator_.Next() > #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:336 ReadNext(&batch) > #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:347 ReadAll(&batches) > #> ℹ If you have supplied a schema and your data contains a header row, you > should supply the argument `skip = 1` to prevent the header being read in as > data. > open_dataset(td, format = "csv", schema = schema, skip = 1) %>% > collect() > #> Error: The following option is supported in "read_delim_arrow" functions > but not yet supported here: "skip" > open_dataset(td, format = "csv", schema = schema, skip_rows = 1) %>% > collect() > #> # A tibble: 32 × 11 > #> mpg cyl disphp dratwt qsecvsam gear carb > #> > #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 > #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 > #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 > #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 > #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 > #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 > #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 > #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 > #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 > #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 > #> # … with 22 more rows > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15656) [C++] [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502323#comment-17502323 ] Jonathan Keane commented on ARROW-15656: cc [~apitrou] > [C++] [R] Valgrind error with C-data interface > -- > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Jonathan Keane >Priority: Major > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937: R_execClosure (eval.c:1918) > ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) > ==10301== Uninitialised value was created by a heap allocation > ==10301==at 0x483E0F0: memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0x483E212: posix_memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0xF4756DF: arrow::(anonymous > namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) > (memory_pool.cc:365) > ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) > (memory_pool.cc:557) > ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) > ==10301==by 0xF041EC2: arrow::Status > GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1} const&) (memorypool.cpp:46) > ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) > (memorypool.cpp:28) > ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) > (memory_pool.cc:921) > ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) > (memory_pool.cc:945) > ==10301==by 0xF478A74: ResizePoolBuffer, > std::unique_ptr > (memory_pool.cc:984) > ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) > (memory_pool.cc:992) > ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) > (buffer.cc:174) > ==10301==by 0xF38CC77: arrow::(anonymous > namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > > const&, arrow::MemoryPool*, std::shared_ptr*) > (concatenate.cc:81) > ==10301== > test-dataset.R:852:3 [success] > {code} > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 > It surfaced with > https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 > Though it could be from: > https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > which added some code to make a source node from the C-Data interface. > However, the first call looks like it might be the line > https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15775) [R] Clean up as.* methods to use build_expr()
[ https://issues.apache.org/jira/browse/ARROW-15775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15775. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12563 [https://github.com/apache/arrow/pull/12563] > [R] Clean up as.* methods to use build_expr() > - > > Key: ARROW-15775 > URL: https://issues.apache.org/jira/browse/ARROW-15775 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Originally raised as part of [PR > #12433|https://github.com/apache/arrow/pull/12433]. > {quote} > This implementation made me think of the various as.* methods we have defined > [1] (since this is similar to as.Date()). Which all use a simpler setup to > create a cast operation. However, I noticed that for those, they are using > Expression$create(...) rather than the build_expr(...) helper [2]. That > build_expr(...) here should handle the wrapping of R objects into Scalars > (...) > (...)We should also open a jira if one doesn't exist to clean up those as.* > methods to use build_expr() > {quote} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15701) [R] month() should allow integer inputs
[ https://issues.apache.org/jira/browse/ARROW-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15701. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12482 [https://github.com/apache/arrow/pull/12482] > [R] month() should allow integer inputs > --- > > Key: ARROW-15701 > URL: https://issues.apache.org/jira/browse/ARROW-15701 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > *Conclusion*: we will implement this in the R bindings - month will allow > integer input: {{month(int)}} will return {{int}} as long as {{int}} is > between 1 and 12. > == > In R, more specifically in {{{}lubridate{}}}, {{month()}} can be used both to > get and set the corresponding component of a date. This means {{month()}} > accepts integer inputs. > {code:r} > suppressPackageStartupMessages(library(lubridate)) > month(1:12) > #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 > month(1:13) > #> Error in month.numeric(1:13): Values are not in 1:12 > {code} > Solving this would allow us to implement bindings such as `semester()` in a > manner closer to {{{}lubridate{}}}. > {code:r} > suppressPackageStartupMessages(library(dplyr)) > suppressPackageStartupMessages(library(lubridate)) > test_df <- tibble( > month_as_int = c(1:12, NA), > month_as_char_pad = ifelse(month_as_int < 10, paste0("0", month_as_int), > month_as_int), > dates = as.Date(paste0("2021-", month_as_char_pad, "-15")) > ) > test_df %>% > mutate( > sem_date = semester(dates), > sem_month_as_int = semester(month_as_int)) > #> # A tibble: 13 × 5 > #>month_as_int month_as_char_pad dates sem_date sem_month_as_int > #> > #> 11 012021-01-1511 > #> 22 022021-02-1511 > #> 33 032021-03-1511 > #> 44 042021-04-1511 > #> 55 052021-05-1511 > #> 66 062021-06-1511 > #> 77 072021-07-1522 > #> 88 082021-08-1522 > #> 99 092021-09-1522 > #> 10 10 102021-10-1522 > #> 11 11 112021-11-1522 > #> 12 12 122021-12-1522 > #> 13 NA NA NA NA > {code} > Currently attempts to use {{month()}} with integer inputs errors with: > {code:r} > Function 'month' has no kernel matching input types (array[int32]) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15168) [R] Add S3 generics to create main Arrow objects
[ https://issues.apache.org/jira/browse/ARROW-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504651#comment-17504651 ] Jonathan Keane commented on ARROW-15168: This sounds good to me. We do have a few of these helpers (though they aren't generics...) like {{arrow_table}}. I'm fine with transitioning all of those to {{as_...}} versions of themselves, or we could drop the {{as_}} and repurpose them (AFAIK {{arrow_table}} is literally an alias for {{Table$create}} right now.) > [R] Add S3 generics to create main Arrow objects > > > Key: ARROW-15168 > URL: https://issues.apache.org/jira/browse/ARROW-15168 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Priority: Major > > Right now we create Tables, RecordBatches, ChunkedArrays, and Arrays using > the corresponding {{$create()}} functions (or a few shortcut functions). This > works well for converting other Arrow or base R types to Arow objects but > doesn’t work well for objects in other packages (e.g., sf). This is related > to ARROW-14378 in that it provides a mechanism for other packages support > writing objects to Arrow in a more Arrow-native form instead of serializing > attributes that are unlikely to be readable in other packages. Many of these > came up when experimenting with {{carrow}} when trying to provide seamless > arrow package compatibility for S3 objects that wrap external pointers to C > API data structures. S3 is a good way to do this because the other package > doesn't have to put arrow in {{Imports}} since it's a heavy dependency. > For argument’s sake I’ll propose adding the following methods: > - {{as_arrow_array(x, type = NULL)}} -> {{Array}} > - {{as_arrow_chunked_array(x, type = NULL)}} -> {{ChunkedArray}} > - {{as_arrow_record_batch(x, schema = NULL)}} -> {{RecordBatch}} > - {{as_arrow_table(x, schema = NULL)}} -> {{Table}} > - {{as_arrow_data_type(x)}} -> {{DataType}} > - {{as_arrow_record_batch_reader(x, schema = NULL)}} -> > {{RecordBatchReader}} > I’ll note that use {{as_adq()}} internally for similar reasons (to convert a > few different object types into a arrow dplyr query when that’s the data > structure we need). > As part of this ticket, if we choose to move forward, we should implement the > default methods with some internal consistency (i.e., somebody wanting to > provide Arrow support in a package probably only has to implement > {{as_arrow_array()}} to get most support. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12212) [R][CI] Test nightly on solaris
[ https://issues.apache.org/jira/browse/ARROW-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504655#comment-17504655 ] Jonathan Keane commented on ARROW-12212: I'm closing this since we no longer need to jump through this particularly special hoop. > [R][CI] Test nightly on solaris > --- > > Key: ARROW-12212 > URL: https://issues.apache.org/jira/browse/ARROW-12212 > Project: Apache Arrow > Issue Type: New Feature > Components: Continuous Integration, R >Reporter: Neal Richardson >Priority: Major > > Followup to ARROW-10734. Setting up a solaris vm on github actions may be > possible. We can try to setup https://github.com/vmactions/solaris-vm with R > from https://files.r-hub.io/opencsw/. A temporary solution could be a nightly > r-hub build kicked off by the arrow-r-nightly CI; it would email me with the > results. Not ideal but it would at least alert us to issues closer to when > they are merged and not just at release time. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-12212) [R][CI] Test nightly on solaris
[ https://issues.apache.org/jira/browse/ARROW-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12212. -- Resolution: Won't Fix > [R][CI] Test nightly on solaris > --- > > Key: ARROW-12212 > URL: https://issues.apache.org/jira/browse/ARROW-12212 > Project: Apache Arrow > Issue Type: New Feature > Components: Continuous Integration, R >Reporter: Neal Richardson >Priority: Major > > Followup to ARROW-10734. Setting up a solaris vm on github actions may be > possible. We can try to setup https://github.com/vmactions/solaris-vm with R > from https://files.r-hub.io/opencsw/. A temporary solution could be a nightly > r-hub build kicked off by the arrow-r-nightly CI; it would email me with the > results. Not ideal but it would at least alert us to issues closer to when > they are merged and not just at release time. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15914) [CI] Separate verification from tests in nightly
Jonathan Keane created ARROW-15914: -- Summary: [CI] Separate verification from tests in nightly Key: ARROW-15914 URL: https://issues.apache.org/jira/browse/ARROW-15914 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Jonathan Keane Could we split up the nightly report that has the test builds from the report that has the verification builds (and maybe include all the packaging into the separate verification builds?) The verification builds tend to take much longer than the test builds, so are frequently still pending even with 6 hours between starting the test builds and running the report. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14199) [R] bindings for format where possible
[ https://issues.apache.org/jira/browse/ARROW-14199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14199. Resolution: Fixed Issue resolved by pull request 12319 [https://github.com/apache/arrow/pull/12319] > [R] bindings for format where possible > -- > > Key: ARROW-14199 > URL: https://issues.apache.org/jira/browse/ARROW-14199 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Now that we have {{strftime}}, we should also be able to make bindings for > {{format()}} as well. This might be complicated / we might need to punt on a > bunch of types that {{format()}} can take but arrow doesn't (yet) support > formatting of them, that's ok. > Though some of those might be wrappable with a handful of kernels stacked > together: {{format(float)}} might be round + cast to character -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14797) [R] write_feather R Arrow freezing on windows 11
[ https://issues.apache.org/jira/browse/ARROW-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507062#comment-17507062 ] Jonathan Keane commented on ARROW-14797: Sorry for the long delay here. Have you tried this again since we released 7.0.0? > [R] write_feather R Arrow freezing on windows 11 > > > Key: ARROW-14797 > URL: https://issues.apache.org/jira/browse/ARROW-14797 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 6.0.1 > Environment: windows 11, rstudio/r 4.1.2 >Reporter: Xavier Timbeau >Priority: Critical > > When writing multiple large files using write_feather (possibly parquet also) > on windows 11 using the arrow R package, write_feather is freezing at some > point (after a few files copied). Changing cpu_count from 16 to 4 seems to > solve the issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15665) [C++] Add error handling option to StrptimeOptions
[ https://issues.apache.org/jira/browse/ARROW-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507179#comment-17507179 ] Jonathan Keane commented on ARROW-15665: (3) sounds wrong, but like I said before: you ([~dragosmg]) should look at what happens in python or if there is some standard where that is the indeed the right thing. (1) + (2) both sound like they could be implemented as "if strptime fails to parse, (optionally) return null". No reason for us to go to far into why it didn't parse. > [C++] Add error handling option to StrptimeOptions > -- > > Key: ARROW-15665 > URL: https://issues.apache.org/jira/browse/ARROW-15665 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > We want to have an option to either raise, ignore or return NA in case of > format mismatch. > See > [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] > and lubridate > [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] > for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-14797) [R] write_feather R Arrow freezing on windows 11
[ https://issues.apache.org/jira/browse/ARROW-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-14797. -- Resolution: Resolved > [R] write_feather R Arrow freezing on windows 11 > > > Key: ARROW-14797 > URL: https://issues.apache.org/jira/browse/ARROW-14797 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 6.0.1 > Environment: windows 11, rstudio/r 4.1.2 >Reporter: Xavier Timbeau >Priority: Critical > > When writing multiple large files using write_feather (possibly parquet also) > on windows 11 using the arrow R package, write_feather is freezing at some > point (after a few files copied). Changing cpu_count from 16 to 4 seems to > solve the issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15970) [R] [CI] Re-enable DuckDB dev tests
Jonathan Keane created ARROW-15970: -- Summary: [R] [CI] Re-enable DuckDB dev tests Key: ARROW-15970 URL: https://issues.apache.org/jira/browse/ARROW-15970 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jonathan Keane When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should re-enable the DuckDB dev branch tests that we disabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15970) [R] [CI] Re-enable DuckDB dev tests
[ https://issues.apache.org/jira/browse/ARROW-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15970: --- Description: When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should re-enable the DuckDB dev branch tests that we disabled. PR that disabled: https://github.com/apache/arrow/pull/12666 was:When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should re-enable the DuckDB dev branch tests that we disabled. > [R] [CI] Re-enable DuckDB dev tests > --- > > Key: ARROW-15970 > URL: https://issues.apache.org/jira/browse/ARROW-15970 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > > When https://github.com/duckdb/duckdb/issues/3258 is resolved, we should > re-enable the DuckDB dev branch tests that we disabled. > PR that disabled: https://github.com/apache/arrow/pull/12666 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15973) [CI] Split nightly reports into three: Tests, Packaging, Release
Jonathan Keane created ARROW-15973: -- Summary: [CI] Split nightly reports into three: Tests, Packaging, Release Key: ARROW-15973 URL: https://issues.apache.org/jira/browse/ARROW-15973 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15929) [R] io_thread_count is actually the CPU thread count
[ https://issues.apache.org/jira/browse/ARROW-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15929. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12633 [https://github.com/apache/arrow/pull/12633] > [R] io_thread_count is actually the CPU thread count > > > Key: ARROW-15929 > URL: https://issues.apache.org/jira/browse/ARROW-15929 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: David Li >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > [https://github.com/apache/arrow/blob/5cb5afc40547b4f75739e31ff8632c71a10d3084/r/src/threadpool.cpp#L51-L57] > This accidentally references the wrong pool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15875) [R][C++] Include md5sum in S3 method for GetFileInfo()
[ https://issues.apache.org/jira/browse/ARROW-15875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15875. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12623 [https://github.com/apache/arrow/pull/12623] > [R][C++] Include md5sum in S3 method for GetFileInfo() > -- > > Key: ARROW-15875 > URL: https://issues.apache.org/jira/browse/ARROW-15875 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > GetFileInfo() seems to include mtime, size, path and type. For an S3 system, > it would be nice to be able to reference the md5 sum without transferring the > file, (which I think the server will have already computed?). This seems > like the logical place to include it (though I wouldn't object to a more > visible method too). > > > (though type isn't clear to me, since it appears to be an integer) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15627) [R] Support unify_schemas for union datasets
[ https://issues.apache.org/jira/browse/ARROW-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15627. Resolution: Fixed Issue resolved by pull request 12629 [https://github.com/apache/arrow/pull/12629] > [R] Support unify_schemas for union datasets > > > Key: ARROW-15627 > URL: https://issues.apache.org/jira/browse/ARROW-15627 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Labels: dataset, pull-request-available > Fix For: 8.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Also out of discussion on [https://github.com/apache/arrow/issues/12371] > You can unify schemas between different parquet files, but it seems like you > can't union together two (or more) datasets that have different schemas. This > is odd, because we do compute the unified schema on [this > line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189], > only to later assert all the schemas are the same. > {code:R} > library(arrow) > library(dplyr) > df1 <- arrow_table(x = array(c(1, 2, 3)), >y = array(c("a", "b", "c"))) > df2 <- arrow_table(x = array(c(4, 5)), >z = array(c("d", "e"))) > df1 %>% write_dataset("example1", format="parquet") > df2 %>% write_dataset("example2", format="parquet") > ds1 <- open_dataset("example1", format="parquet") > ds2 <- open_dataset("example2", format="parquet") > # These don't work > ds <- c(ds1, ds2) # c() actually does the same thing > ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema > ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas > = TRUE) > # This does > ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), > format="parquet", unify_schemas = TRUE) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14679) [R] [C++] Handle suffix argument in joins
[ https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14679. Resolution: Fixed Issue resolved by pull request 12113 [https://github.com/apache/arrow/pull/12113] > [R] [C++] Handle suffix argument in joins > - > > Key: ARROW-14679 > URL: https://issues.apache.org/jira/browse/ARROW-14679 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Jonathan Keane >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available, query-engine > Fix For: 8.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > If there is a name collision, we need to do something > https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31 > A few notes: > * arrow doesn't seem to actually be able to apply the prefixes (I'm getting > errors when trying), I couldn't tell if there were tests of this — I couldn't > find any, so I'm not sure if I'm calling this wrong or if it's not working at > all. > * arrow always appends the affixes (where as dplyr only adds them if there is > a name collision) > * arrow only supports prefixes (can we configure this, or ask the clients to > provide new names?) in the tests I wrote I've worked around this, but it > would be nice to be able to match dplyr/allow things other than prefix -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15802) [R] Implement bindings for lubridate::make_datetime() and lubridate::make_date()
[ https://issues.apache.org/jira/browse/ARROW-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15802. Resolution: Fixed Issue resolved by pull request 12622 [https://github.com/apache/arrow/pull/12622] > [R] Implement bindings for lubridate::make_datetime() and > lubridate::make_date() > > > Key: ARROW-15802 > URL: https://issues.apache.org/jira/browse/ARROW-15802 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > plus {{base::ISOdate()}} and {{base::ISOdatetime()}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15489) [R] Expand RecordBatchReader use-ability
[ https://issues.apache.org/jira/browse/ARROW-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15489. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12567 [https://github.com/apache/arrow/pull/12567] > [R] Expand RecordBatchReader use-ability > - > > Key: ARROW-15489 > URL: https://issues.apache.org/jira/browse/ARROW-15489 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > In ARROW-14745 we thought about having {{to_arrow()}} returning a > RecordBatchReader only. Though this would work, it's not quite as friendly as > wrapping the RecordBatchReader since {{arrow_dplyr_query}}s have a (slightly) > nicer print method. > We should add more methods and a print method that makes it clearer what a > RecordBatchReader is and what it might be useful to do (e.g. continue a dplyr > query) > Is it possible that we could make up a name/class that encompasses all of the > Arrow tabular like things that we could wrap all of these up in (for UX > purposes only, really). We have ArrowTabular now, maybe we lean into that > more (along side an LazyArrowTabular like dbplyr has?). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510845#comment-17510845 ] Jonathan Keane commented on ARROW-16007: Good catch, as far as I see there was not a principled reason to diverge like this. We would of course welcome a PR to add tests for this + update the Arrow {{grepl}} behavior to match base R's. Let us know if you would like any pointers > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15857) [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)
[ https://issues.apache.org/jira/browse/ARROW-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511856#comment-17511856 ] Jonathan Keane commented on ARROW-15857: {sass} just had an update (purportedly to fix warnings in gcc), but that has not resolved these issues. Though, _on cran_ the devel-fedora-clang build is just fine: https://cran.r-project.org/web/checks/check_results_sass.html The image itself are a bit stale — I've reported that to rhub to ask them to reenable their jobs: https://github.com/r-hub/rhub-linux-builders/issues/59 but I don't think that should matter. The symbol it's complaining about _is_ part of stdlib: {{_ZTINSt3__113basic_ostreamIcNS_11char_traitsIc}} which is slightly different in this setup to match the special cran setup see: https://github.com/apache/arrow/blob/623a15e7f7a45578733956714c8dddcc9f66f015/ci/scripts/r_docker_configure.sh#L45-L55 > [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency) > -- > > Key: ARROW-15857 > URL: https://issues.apache.org/jira/browse/ARROW-15857 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dewey Dunnington >Priority: Major > > Starting 2022-03-03, we get a failure on the rhub/fedora-clang-devel nightly > build. It seems to be a linking error but nothing in the sass package seems > to have changed for some time (last update May 2021). > https://github.com/ursacomputing/crossbow/runs/5444005154?check_suite_focus=true#step:5:3007 > Build log for the sass package: > {noformat} > #14 1099.2 make[1]: Entering directory > '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src' > #14 1099.2 /opt/R-devel/lib64/R/share/make/shlib.mk:18: warning: overriding > recipe for target 'shlib-clean' > #14 1099.2 Makevars:12: warning: ignoring old recipe for target 'shlib-clean' > #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG > -I./libsass/include -I/usr/local/include -fpic -g -O2 -c compile.c -o > compile.o > #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG > -I./libsass/include -I/usr/local/include -fpic -g -O2 -c init.c -o init.o > #14 1099.2 MAKEFLAGS= CC="/usr/bin/clang" CFLAGS="-g -O2 " > CXX="/usr/bin/clang++ -std=gnu++14 -stdlib=libc++" AR="ar" > LDFLAGS="-L/usr/local/lib64" make -C libsass > #14 1099.2 make[2]: Entering directory > '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src/libsass' > #14 1099.2 /usr/bin/clang -g -O2 -O2 -I ./include -fPIC -c -o src/cencode.o > src/cencode.c > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast.o src/ast.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_values.o src/ast_values.cpp > #14 1099.2 src/ast_values.cpp:484:23: warning: loop variable 'numerator' > creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' > [-Wrange-loop-construct] > #14 1099.2 for (const auto numerator : numerators) > #14 1099.2 ^ > #14 1099.2 src/ast_values.cpp:484:12: note: use reference type 'const > std::__1::basic_string, > std::__1::allocator> &' to prevent copying > #14 1099.2 for (const auto numerator : numerators) > #14 1099.2^~ > #14 1099.2 & > #14 1099.2 src/ast_values.cpp:486:23: warning: loop variable 'denominator' > creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' > [-Wrange-loop-construct] > #14 1099.2 for (const auto denominator : denominators) > #14 1099.2 ^ > #14 1099.2 src/ast_values.cpp:486:12: note: use reference type 'const > std::__1::basic_string, > std::__1::allocator> &' to prevent copying > #14 1099.2 for (const auto denominator : denominators) > #14 1099.2^~~~ > #14 1099.2 & > #14 1099.2 2 warnings generated. > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_supports.o src/ast_supports.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_cmp.o src/ast_sel_cmp.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_unify.o src/ast_sel_unify.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_super.o src/ast_sel_super.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_weave.o src/ast_sel_weave.cpp > #14
[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511971#comment-17511971 ] Jonathan Keane commented on ARROW-16007: You've definitely identified the two paths for this. I agree with your hesitance that empty string and nulls shouldn't be conflated. The string conversion from R is a bit complicated, but https://github.com/apache/arrow/blob/ddb663b1724034f64cc53d62bd2d5a4e8fa42954/r/src/r_to_arrow.cpp#L777-L824 (and the rest of that file) is a good starting point. All of that being said, I would probably go the second route you mention (and sorry for not responding with this earlier!): > but then it occurred to me that if this is just a special case in R, maybe > it's better to do it on the R side and just change NA to FALSE in the return > value of the binding of grepl? You could put a call to {{if_else}} + {{is.na}} bindings inside of the {{grepl}} binding and get the behavior in R. We do have some support via options for different null handling behaviors for other functions, but I suspect R is a bit of an outlier here (I tried to construct a reprex in Python to see if I could see what it does, but every `re.match()` with anything missing-like is a type error!). > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512025#comment-17512025 ] Jonathan Keane commented on ARROW-16007: Yeah the (possibly slightly misnamed) {{call_binding}} builds an Expression that will later be evaluated at {{collect}} time. A similar setup is used at https://github.com/apache/arrow/blob/012ae6e961dbb472c7862f40be5dc972a9bd3e91/r/R/dplyr-funcs-datetime.R#L221-L226 in {{ISOdatetime}} to run {{sec=NA}} into {{0}} here (another R oddity!) > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()
[ https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15098. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12506 [https://github.com/apache/arrow/pull/12506] > [R] Add binding for lubridate::duration() and/or as.difftime() > -- > > Key: ARROW-15098 > URL: https://issues.apache.org/jira/browse/ARROW-15098 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dewey Dunnington >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 14.5h > Remaining Estimate: 0h > > After ARROW-14941 we have support for the duration type; however, there is no > binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr > evaluation that could create these objects. I'm actually not sure if we > should bind {{lubridate::duration}} since it returns a custom S4 class that's > identical in function to base R's difftime. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16030) [R] Add a schema method for arrow_dplyr_query
Jonathan Keane created ARROW-16030: -- Summary: [R] Add a schema method for arrow_dplyr_query Key: ARROW-16030 URL: https://issues.apache.org/jira/browse/ARROW-16030 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Jonathan Keane we have {{implicit_schema()}} which can generate the final schema for a query, though it's not exported. Maybe we "just" export that for people to be able to get what the schema of the resulting query will be. Alternatively we could make a {{schema}} (S3) method that would return the (implicit) schema with {{schema(query_obj)}}. Though this might be overloading {{schema}} since that is not how we retrieve schemas elsewhere (e.g. {{schema(arrow_table(mtcars))}} does not currently work). One use case: https://github.com/duckdb/duckdb/pull/3299 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15656) [C++] [R] Valgrind error with C-data interface
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15656: -- Assignee: Jonathan Keane > [C++] [R] Valgrind error with C-data interface > -- > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937: R_execClosure (eval.c:1918) > ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) > ==10301== Uninitialised value was created by a heap allocation > ==10301==at 0x483E0F0: memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0x483E212: posix_memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0xF4756DF: arrow::(anonymous > namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) > (memory_pool.cc:365) > ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) > (memory_pool.cc:557) > ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) > ==10301==by 0xF041EC2: arrow::Status > GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1} const&) (memorypool.cpp:46) > ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) > (memorypool.cpp:28) > ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) > (memory_pool.cc:921) > ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) > (memory_pool.cc:945) > ==10301==by 0xF478A74: ResizePoolBuffer, > std::unique_ptr > (memory_pool.cc:984) > ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) > (memory_pool.cc:992) > ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) > (buffer.cc:174) > ==10301==by 0xF38CC77: arrow::(anonymous > namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > > const&, arrow::MemoryPool*, std::shared_ptr*) > (concatenate.cc:81) > ==10301== > test-dataset.R:852:3 [success] > {code} > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 > It surfaced with > https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 > Though it could be from: > https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > which added some code to make a source node from the C-Data interface. > However, the first call looks like it might be the line > https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15656) [C++] [R] Make valgrind builds slightly quicker
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Summary: [C++] [R] Make valgrind builds slightly quicker (was: [C++] [R] Valgrind error with C-data interface) > [C++] [R] Make valgrind builds slightly quicker > --- > > Key: ARROW-15656 > URL: https://issues.apache.org/jira/browse/ARROW-15656 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > This is currently failing on our valgrind nightly: > {code} > ==10301==by 0x49A2184: bcEval (eval.c:7107) > ==10301==by 0x498DBC8: Rf_eval (eval.c:748) > ==10301==by 0x4990937: R_execClosure (eval.c:1918) > ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) > ==10301== Uninitialised value was created by a heap allocation > ==10301==at 0x483E0F0: memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0x483E212: posix_memalign (in > /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10301==by 0xF4756DF: arrow::(anonymous > namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) > (memory_pool.cc:365) > ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl namespace)::SystemAllocator>::Allocate(long, unsigned char**) > (memory_pool.cc:557) > ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) > ==10301==by 0xF041EC2: arrow::Status > GcMemoryPool::GcAndTryAgain char**)::{lambda()#1}>(GcMemoryPool::Allocate(long, unsigned > char**)::{lambda()#1} const&) (memorypool.cpp:46) > ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) > (memorypool.cpp:28) > ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) > (memory_pool.cc:921) > ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) > (memory_pool.cc:945) > ==10301==by 0xF478A74: ResizePoolBuffer, > std::unique_ptr > (memory_pool.cc:984) > ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) > (memory_pool.cc:992) > ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) > (buffer.cc:174) > ==10301==by 0xF38CC77: arrow::(anonymous > namespace)::ConcatenateBitmaps(std::vector namespace)::Bitmap, std::allocator > > const&, arrow::MemoryPool*, std::shared_ptr*) > (concatenate.cc:81) > ==10301== > test-dataset.R:852:3 [success] > {code} > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 > It surfaced with > https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 > Though it could be from: > https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 > which added some code to make a source node from the C-Data interface. > However, the first call looks like it might be the line > https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15656) [C++] [R] Make valgrind builds slightly quicker
[ https://issues.apache.org/jira/browse/ARROW-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15656: --- Description: It looks like these specific errors have been resolved in other tickets. In the process of isolating the issue, I found that we actually were building arrow twice in the build. So I've repurposed this PR to remove the extraneous build. = This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 It surfaced with https://github.com/apache/arrow/commit/858470d928e9ce5098da7ebb1926bb3c74dadff0 Though it could be from: https://github.com/apache/arrow/commit/b868090f0f65a2a66bb9c3d7c0f68c5af1a4dff0 which added some code to make a source node from the C-Data interface. However, the first call looks like it might be the line https://github.com/apache/arrow/blob/fa699117091917f0992225aff4e8d4c08910162a/cpp/src/arrow/compute/kernels/vector_selection.cc#L437 was: This is currently failing on our valgrind nightly: {code} ==10301==by 0x49A2184: bcEval (eval.c:7107) ==10301==by 0x498DBC8: Rf_eval (eval.c:748) ==10301==by 0x4990937: R_execClosure (eval.c:1918) ==10301==by 0x49905EA: Rf_applyClosure (eval.c:1844) ==10301== Uninitialised value was created by a heap allocation ==10301==at 0x483E0F0: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0x483E212: posix_memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==10301==by 0xF4756DF: arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) (memory_pool.cc:365) ==10301==by 0xF475859: arrow::BaseMemoryPoolImpl::Allocate(long, unsigned char**) (memory_pool.cc:557) ==10301==by 0xF04192E: GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1}::operator()() const (memorypool.cpp:28) ==10301==by 0xF041EC2: arrow::Status GcMemoryPool::GcAndTryAgain(GcMemoryPool::Allocate(long, unsigned char**)::{lambda()#1} const&) (memorypool.cpp:46) ==10301==by 0xF0419A3: GcMemoryPool::Allocate(long, unsigned char**) (memorypool.cpp:28) ==10301==by 0xF479EF7: arrow::PoolBuffer::Reserve(long) (memory_pool.cc:921) ==10301==by 0xF479FCD: arrow::PoolBuffer::Resize(long, bool) (memory_pool.cc:945) ==10301==by 0xF478A74: ResizePoolBuffer, std::unique_ptr > (memory_pool.cc:984) ==10301==by 0xF478A74: arrow::AllocateBuffer(long, arrow::MemoryPool*) (memory_pool.cc:992) ==10301==by 0xF458BAD: arrow::AllocateBitmap(long, arrow::MemoryPool*) (buffer.cc:174) ==10301==by 0xF38CC77: arrow::(anonymous namespace)::ConcatenateBitmaps(std::vector > const&, arrow::MemoryPool*, std::shared_ptr*) (concatenate.cc:81) ==10301== test-dataset.R:852:3 [success] {code} https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=19519&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 It surfaced with https://github.com/apache/arrow/commit/
[jira] [Updated] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16007: --- Component/s: R > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions
[ https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16034: -- Assignee: Andy Teucher > [R] should bindings for grepl etc emit warnings matching those in base R > functions > -- > > Key: ARROW-16034 > URL: https://issues.apache.org/jira/browse/ARROW-16034 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Assignee: Andy Teucher >Priority: Minor > > {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit > a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in > the [PR]([https://github.com/apache/arrow/pull/12711]) for this > [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this > should be mimicked in the arrow bindings as well? [~jonkeane] requested I > open this here for discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16007: -- Assignee: Andy Teucher > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Assignee: Andy Teucher >Priority: Minor > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-16007) [R] binding for grepl has different behaviour with NA compared to R base grepl
[ https://issues.apache.org/jira/browse/ARROW-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-16007. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12711 [https://github.com/apache/arrow/pull/12711] > [R] binding for grepl has different behaviour with NA compared to R base grepl > -- > > Key: ARROW-16007 > URL: https://issues.apache.org/jira/browse/ARROW-16007 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Assignee: Andy Teucher >Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The arrow binding to {{grepl}} behaves slightly differently than the base R > {{{}grepl{}}}, in that it returns {{NA}} for {{NA}} inputs, whereas base > {{grepl}} returns {{{}FALSE with NA inputs. arrow's implementation is > consistent with stringr::str_detect(){}}}, and both {{str_detect()}} and > {{grepl()}} are bound to {{match_substring_regex}} and {{match_substring}} in > arrow. > I don't know if this is something you would want to change so that the > {{grepl}} behaviour aligns with base {{{}grepl{}}}, or simply document this > difference? > Reprex: > > {code:r} > library(arrow, warn.conflicts = FALSE, quietly = TRUE) > library(dplyr, warn.conflicts = FALSE, quietly = TRUE) > library(stringr, quietly = TRUE) > alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_)) > alpha_dataset <- InMemoryDataset$create(alpha_df) > mutate(alpha_df, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 FALSE NA > mutate(alpha_dataset, > grepl_is_a = grepl("a", alpha), > stringr_is_a = str_detect(alpha, "a")) |> > collect() > #> alpha grepl_is_a stringr_is_a > #> 1 alpha TRUE TRUE > #> 2 bet FALSE FALSE > #> 3 NA NA > # base R grepl returns FALSE for NA > grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex > #> [1] TRUE FALSE FALSE > grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring > #> [1] TRUE FALSE FALSE > # stringr::str_dectect returns NA for NA > str_detect(alpha_df$alpha, "a") > #> [1] TRUE FALSE NA > alpha_array <- Array$create(alpha_df$alpha) > # arrow functions return null for null (NA) > call_function("match_substring_regex", alpha_array, options = list(pattern = > "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > call_function("match_substring", alpha_array, options = list(pattern = "a")) > #> Array > #> > #> [ > #> true, > #> false, > #> null > #> ] > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions
[ https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16034: -- Assignee: (was: Andy Teucher) > [R] should bindings for grepl etc emit warnings matching those in base R > functions > -- > > Key: ARROW-16034 > URL: https://issues.apache.org/jira/browse/ARROW-16034 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > > {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit > a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in > the [PR]([https://github.com/apache/arrow/pull/12711]) for this > [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this > should be mimicked in the arrow bindings as well? [~jonkeane] requested I > open this here for discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-16034) [R] should bindings for grepl etc emit warnings matching those in base R functions
[ https://issues.apache.org/jira/browse/ARROW-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512645#comment-17512645 ] Jonathan Keane commented on ARROW-16034: This is definitely nice to give folks a heads up that one of the arguments is being ignored. Looking at the implementation for {{grepl}}, it wouldn't take much overhead at all to add a warning when those two arguments are given. I would be in favor of doing this to be extra clear and helpful. I'm curious if [~icook] has any reason this didn't or wouldn't work form the original implementation? > [R] should bindings for grepl etc emit warnings matching those in base R > functions > -- > > Key: ARROW-16034 > URL: https://issues.apache.org/jira/browse/ARROW-16034 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Andy Teucher >Priority: Minor > > {{{}grepl{}}}, {{{}sub{}}}, and {{gsub}} (and perhaps others) in base R emit > a warning when {{ignore.case = TRUE}} and {{{}fixed = TRUE{}}}. As raised in > the [PR]([https://github.com/apache/arrow/pull/12711]) for this > [issue](https://issues.apache.org/jira/browse/ARROW-16007), wondering if this > should be mimicked in the arrow bindings as well? [~jonkeane] requested I > open this here for discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16047) [C++] Option for match_substring* to return NULL on NULL input
Jonathan Keane created ARROW-16047: -- Summary: [C++] Option for match_substring* to return NULL on NULL input Key: ARROW-16047 URL: https://issues.apache.org/jira/browse/ARROW-16047 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane We implemented this in the R bindings as a wrapper in ARROW-16007 (and there's even a start of an implementation at https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f provided by [~ateucher] ) This is should be done for {{match_substring}} and {{match_substring_regex}} (any others?) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15857) [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency)
[ https://issues.apache.org/jira/browse/ARROW-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513469#comment-17513469 ] Jonathan Keane commented on ARROW-15857: OH OH OH yes, of course it's that! Now I feel silly I didn't catch that, I saw in the logs that it's a linking issue, but didn't connect that through to {{LDFLAGS}} thanks for the catch! > [R] rhub/fedora-clang-devel fails to install 'sass' (rmarkdown dependency) > -- > > Key: ARROW-15857 > URL: https://issues.apache.org/jira/browse/ARROW-15857 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Starting 2022-03-03, we get a failure on the rhub/fedora-clang-devel nightly > build. It seems to be a linking error but nothing in the sass package seems > to have changed for some time (last update May 2021). > https://github.com/ursacomputing/crossbow/runs/5444005154?check_suite_focus=true#step:5:3007 > Build log for the sass package: > {noformat} > #14 1099.2 make[1]: Entering directory > '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src' > #14 1099.2 /opt/R-devel/lib64/R/share/make/shlib.mk:18: warning: overriding > recipe for target 'shlib-clean' > #14 1099.2 Makevars:12: warning: ignoring old recipe for target 'shlib-clean' > #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG > -I./libsass/include -I/usr/local/include -fpic -g -O2 -c compile.c -o > compile.o > #14 1099.2 /usr/bin/clang -I"/opt/R-devel/lib64/R/include" -DNDEBUG > -I./libsass/include -I/usr/local/include -fpic -g -O2 -c init.c -o init.o > #14 1099.2 MAKEFLAGS= CC="/usr/bin/clang" CFLAGS="-g -O2 " > CXX="/usr/bin/clang++ -std=gnu++14 -stdlib=libc++" AR="ar" > LDFLAGS="-L/usr/local/lib64" make -C libsass > #14 1099.2 make[2]: Entering directory > '/tmp/RtmpvEMraB/R.INSTALL555d42b8f18e/sass/src/libsass' > #14 1099.2 /usr/bin/clang -g -O2 -O2 -I ./include -fPIC -c -o src/cencode.o > src/cencode.c > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast.o src/ast.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_values.o src/ast_values.cpp > #14 1099.2 src/ast_values.cpp:484:23: warning: loop variable 'numerator' > creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' > [-Wrange-loop-construct] > #14 1099.2 for (const auto numerator : numerators) > #14 1099.2 ^ > #14 1099.2 src/ast_values.cpp:484:12: note: use reference type 'const > std::__1::basic_string, > std::__1::allocator> &' to prevent copying > #14 1099.2 for (const auto numerator : numerators) > #14 1099.2^~ > #14 1099.2 & > #14 1099.2 src/ast_values.cpp:486:23: warning: loop variable 'denominator' > creates a copy from type 'const std::__1::basic_string std::__1::char_traits, std::__1::allocator>' > [-Wrange-loop-construct] > #14 1099.2 for (const auto denominator : denominators) > #14 1099.2 ^ > #14 1099.2 src/ast_values.cpp:486:12: note: use reference type 'const > std::__1::basic_string, > std::__1::allocator> &' to prevent copying > #14 1099.2 for (const auto denominator : denominators) > #14 1099.2^~~~ > #14 1099.2 & > #14 1099.2 2 warnings generated. > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_supports.o src/ast_supports.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_cmp.o src/ast_sel_cmp.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_unify.o src/ast_sel_unify.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_super.o src/ast_sel_super.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_sel_weave.o src/ast_sel_weave.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/ast_selectors.o src/ast_selectors.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/context.o src/context.cpp > #14 1099.2 /usr/bin/clang++ -std=gnu++14 -stdlib=libc++ -Wall -O2 -std=c++11 > -I ./include -fPIC -c -o src/constants.o src/constants.cpp > #14 1099.2 /usr/bin/clang
[jira] [Resolved] (ARROW-15814) [R] Improve documentation for cast()
[ https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15814. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12546 [https://github.com/apache/arrow/pull/12546] > [R] Improve documentation for cast() > > > Key: ARROW-15814 > URL: https://issues.apache.org/jira/browse/ARROW-15814 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Originated in the > [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465] > for ARROW-14820. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15814) [R] Improve documentation for cast()
[ https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15814: -- Assignee: Jonathan Keane > [R] Improve documentation for cast() > > > Key: ARROW-15814 > URL: https://issues.apache.org/jira/browse/ARROW-15814 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Jonathan Keane >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Originated in the > [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465] > for ARROW-14820. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15814) [R] Improve documentation for cast()
[ https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15814: -- Assignee: SHIMA Tatsuya (was: Jonathan Keane) > [R] Improve documentation for cast() > > > Key: ARROW-15814 > URL: https://issues.apache.org/jira/browse/ARROW-15814 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: SHIMA Tatsuya >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 8.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Originated in the > [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465] > for ARROW-14820. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-16047) [C++] Option for match_substring* to return NULL on NULL input
[ https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513539#comment-17513539 ] Jonathan Keane commented on ARROW-16047: I created this ticket as a (possible) follow on and for discussion about how widespread this behavior is + see if it's feasible to do this in C++. If you're still interested in digging more into the C++ code here, feel free to assign this ticket + send a PR [~ateucher] (but no pressure!). > [C++] Option for match_substring* to return NULL on NULL input > -- > > Key: ARROW-16047 > URL: https://issues.apache.org/jira/browse/ARROW-16047 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Priority: Major > > We implemented this in the R bindings as a wrapper in ARROW-16007 (and > there's even a start of an implementation at > https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f > provided by [~ateucher] ) > This is should be done for {{match_substring}} and {{match_substring_regex}} > (any others?) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-16047) [C++] Option for match_substring* to return false on NULL input
[ https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513650#comment-17513650 ] Jonathan Keane commented on ARROW-16047: Ah yes, of course. I've updated the title there > [C++] Option for match_substring* to return false on NULL input > --- > > Key: ARROW-16047 > URL: https://issues.apache.org/jira/browse/ARROW-16047 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Priority: Major > > We implemented this in the R bindings as a wrapper in ARROW-16007 (and > there's even a start of an implementation at > https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f > provided by [~ateucher] ) > This is should be done for {{match_substring}} and {{match_substring_regex}} > (any others?) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-16047) [C++] Option for match_substring* to return false on NULL input
[ https://issues.apache.org/jira/browse/ARROW-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16047: --- Summary: [C++] Option for match_substring* to return false on NULL input (was: [C++] Option for match_substring* to return NULL on NULL input) > [C++] Option for match_substring* to return false on NULL input > --- > > Key: ARROW-16047 > URL: https://issues.apache.org/jira/browse/ARROW-16047 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Jonathan Keane >Priority: Major > > We implemented this in the R bindings as a wrapper in ARROW-16007 (and > there's even a start of an implementation at > https://github.com/apache/arrow/commit/f445450aeeaeee96b9eeba5549c36c5229f18a8f > provided by [~ateucher] ) > This is should be done for {{match_substring}} and {{match_substring_regex}} > (any others?) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16052) [R] undefined global function %>%
Jonathan Keane created ARROW-16052: -- Summary: [R] undefined global function %>% Key: ARROW-16052 URL: https://issues.apache.org/jira/browse/ARROW-16052 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)