[jira] [Created] (ARROW-17886) [R] Convert schema to the corresponding ptype (zero-row data frame)?
Kirill Müller created ARROW-17886: - Summary: [R] Convert schema to the corresponding ptype (zero-row data frame)? Key: ARROW-17886 URL: https://issues.apache.org/jira/browse/ARROW-17886 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Kirill Müller When fetching data e.g. from a RecordBatchReader, I would like to know, ahead of time, what the data will look like after it's converted to a data frame. I have found a way using utils::head(0), but I'm not sure if it's efficient in all scenarios. My use case is the Arrow extension to DBI, in particular the default implementation for drivers that don't speak Arrow yet. I'd like to know which types the columns should have on the database. I can already infer this from the corresponding R types, but those existing drivers don't know about Arrow types. Should we support as.data.frame() for schema objects? The semantics would be to return a zero-row data frame with correct column names and types. library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) tibble::as_tibble(head(rbr, 0)) #> # A tibble: 0 × 4 #> # … with 4 variables: a , b , c , d rbr$read_table() #> Table #> 3 rows x 4 columns #> $a #> $b #> $c #> $d <> #> #> See $metadata for additional Schema metadata -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17885) Return BLOB data as list of raw instead of a list of integers
Kirill Müller created ARROW-17885: - Summary: Return BLOB data as list of raw instead of a list of integers Key: ARROW-17885 URL: https://issues.apache.org/jira/browse/ARROW-17885 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0, 9.0.1 Environment: macOS, R 4.1.3 Reporter: Kirill Müller BLOBs should be mapped to lists of raw in R, not lists of integer. Tested with ec714db3995549309b987fc8112db98bb93102d0. ``` r library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp data <- data.frame( a = 1:3, b = 2.5, c = "three", stringsAsFactors = FALSE ) data$d <- blob::blob(as.raw(1:10)) tbl <- arrow::as_arrow_table(data) rbr <- arrow::as_record_batch_reader(tbl) waldo::compare(as.data.frame(rbr$read_next_batch()), data) #> `old$d[[1]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[1]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[2]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[2]]` is a raw vector (01, 02, 03, 04, 05, ...) #> #> `old$d[[3]]` is an integer vector (1, 2, 3, 4, 5, ...) #> `new$d[[3]]` is a raw vector (01, 02, 03, 04, 05, ...) ``` Created on 2022-09-29 with [reprex v2.0.2](https://reprex.tidyverse.org) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17884) Add Intel®-IAA/QPL-based Parquet RLE Decode
zhaoyaqi created ARROW-17884: Summary: Add Intel®-IAA/QPL-based Parquet RLE Decode Key: ARROW-17884 URL: https://issues.apache.org/jira/browse/ARROW-17884 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: zhaoyaqi Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator available in the upcoming generation of Intel® Xeon® Scalable processors ("Sapphire Rapids"). Its goal is to speed up common operations in analytics like data (de)compression and filtering. It support decoding of Parquet RLE format. We add new codec which utilizes the Intel® IAA offloading technology to provide a high-performance RLE decode implementation. The codec uses the [Intel® Query Processing Library (QPL)|https://github.com/intel/qpl] which abstracts access to the hardware accelerator. The new solution provides in general higher performance against current solution, and also consume less CPU. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17883) [Java] Implement an immutable table object
Larry White created ARROW-17883: --- Summary: [Java] Implement an immutable table object Key: ARROW-17883 URL: https://issues.apache.org/jira/browse/ARROW-17883 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 10.0.0 Reporter: Larry White Implement an immutable Table object without the batch semantics provided by VectorSchemaRoot. See original design document/discussion here: https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing Note that this ticket covers only the immutable Table implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17882) [Java][Doc] Document build & use of new artifact on Windows environment
David Dali Susanibar Arce created ARROW-17882: - Summary: [Java][Doc] Document build & use of new artifact on Windows environment Key: ARROW-17882 URL: https://issues.apache.org/jira/browse/ARROW-17882 Project: Apache Arrow Issue Type: Sub-task Components: Documentation, Java Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce * Update build documentation with new Windows JNI DLL support * Update use documentation with new Windows JNI DLL support -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17881) [C++] Not able to build the project with the latest commit of the master branch
Anirudh Acharya created ARROW-17881: --- Summary: [C++] Not able to build the project with the latest commit of the master branch Key: ARROW-17881 URL: https://issues.apache.org/jira/browse/ARROW-17881 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Anirudh Acharya I am trying to build the arrow C++ project with the latest commit( 9af43f11b) from the master branch using this guide - [https://arrow.apache.org/docs/developers/cpp/building.html] But the build fails with the following error - {code:java} [ 58%] Linking CXX executable ../../debug/arrow-array-test Undefined symbols for architecture x86_64: "testing::Matcher > const&>::Matcher(char const*)", referenced from: testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_test.cc.o testing::Matcher > const&> testing::internal::MatcherCastImpl > const&, char const*>::CastImpl(char const* const&, std::__1::integral_constant, std::__1::integral_constant) in array_binary_test.cc.o ld: symbol(s) not found for architecture x86_64 clang-14: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:207: debug/arrow-array-test] Error 1 make[1]: *** [CMakeFiles/Makefile2:1653: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/types.cc.o [ 58%] Building CXX object src/arrow/CMakeFiles/arrow-table-test.dir/table_builder_test.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_comparison_avx2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/level_conversion_bmi2.cc.o [ 58%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/encryption_internal.cc.o [ 59%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/crypto_factory.cc.o [ 60%] Linking CXX executable ../../debug/arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_unwrapper.cc.o [ 60%] Built target arrow-table-test [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/file_key_wrapper.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/kms_client.cc.o [ 60%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_material.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_metadata.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/key_toolkit_internal.cc.o [ 61%] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encryption/local_wrap_kms_client.cc.o [ 61%] Built target parquet_objlib make: *** [Makefile:146: all] Error 2 {code} I am compiling this on macOS Monterey Version 12.0.1. and versions of GCC, python and clang are as follows - {code:java} $ clang --version clang version 14.0.4 Target: x86_64-apple-darwin21.1.0 Thread model: posix InstalledDir: /Users/anirudhacharya/miniconda3/envs/pyarrow-dev/bin $ python --version Python 3.9.13 $ gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 12.0.5 (clang-1205.0.22.9) Target: x86_64-apple-darwin21.1.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin {code} I see that there were nightly job failures for macOS that were reported in the mailing list - [https://lists.apache.org/thread/rrdwxw1st4vdcf3nh5nqfo16n3ymj90x] I am not sure if this failure is related to the issue I am reporting. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17880) Add support for Decimal types in go/arrow/csv
Mitchell Devenport created ARROW-17880: -- Summary: Add support for Decimal types in go/arrow/csv Key: ARROW-17880 URL: https://issues.apache.org/jira/browse/ARROW-17880 Project: Apache Arrow Issue Type: Improvement Components: Go Reporter: Mitchell Devenport The Go CSV library lacks support for Decimal types which are supported by the C++ CSV library: [arrow/writer.cc at master · apache/arrow (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L378] [arrow/type_traits.h at master · apache/arrow (github.com)|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type_traits.h#L642] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17879) [R] Intermittent memory leaks in the valgrind nightly test
Dewey Dunnington created ARROW-17879: Summary: [R] Intermittent memory leaks in the valgrind nightly test Key: ARROW-17879 URL: https://issues.apache.org/jira/browse/ARROW-17879 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dewey Dunnington Fix For: 10.0.0 The memory leaks that were fixed by a workaround before the last release (ARROW-17252) are present again. I had hoped that the improvements to the captured R thread infrastructure in ARROW-11841 and ARROW-17178 would fix this; however, they don't (and it's not even clear that the failures are related to that, since as part of diagnosing those failures the last time I disabled the safe call infrastructure completely and was still able to observe failures). These failures need to be debugged before the release! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
Andy Grove created ARROW-17878: -- Summary: [Website] Exclude Ballista docs from being deleted Key: ARROW-17878 URL: https://issues.apache.org/jira/browse/ARROW-17878 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17877) [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h
Raúl Cumplido created ARROW-17877: - Summary: [CI][Python] verify-rc python nightly builds fail due to missing arrow/csv/api.h Key: ARROW-17877 URL: https://issues.apache.org/jira/browse/ARROW-17877 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Raúl Cumplido Assignee: Raúl Cumplido Some of our nightly builds are failing with: {code:java} [ 35%] Building CXX object CMakeFiles/_dataset.dir/_dataset.cpp.o /arrow/python/build/temp.linux-x86_64-cpython-38/_dataset.cpp:833:10: fatal error: arrow/csv/api.h: No such file or directory #include "arrow/csv/api.h" ^ compilation terminated.{code} I suspect the flags included CSV=ON when building with PYTHON=ON changes here might be related: [https://github.com/apache/arrow/commit/53ac2a00aa9ff199773513f6f996f73a07b37989] Example of nightly failures: https://github.com/ursacomputing/crossbow/actions/runs/3135833175/jobs/5091988801 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17876) [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries
Jacob Wujciak-Jens created ARROW-17876: -- Summary: [R][CI] Remove ubuntu-18.04 from nixlibs & prebuilt binaries Key: ARROW-17876 URL: https://issues.apache.org/jira/browse/ARROW-17876 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jacob Wujciak-Jens Fix For: 10.0.0 The new dts compiled centos-7 binaries ([ARROW-17594]) should be able to replace the ubuntu-18.04 binaries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17875) [C++] Remove assorted pre-C++17 compatibility measures
Antoine Pitrou created ARROW-17875: -- Summary: [C++] Remove assorted pre-C++17 compatibility measures Key: ARROW-17875 URL: https://issues.apache.org/jira/browse/ARROW-17875 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Some assorted pre-C++17 compatibility measures remain in the code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17874) [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1
Alenka Frim created ARROW-17874: --- Summary: [Archery] C++ linting with --clang-format or archery lint --clang-tidy fails on M1 Key: ARROW-17874 URL: https://issues.apache.org/jira/browse/ARROW-17874 Project: Apache Arrow Issue Type: Bug Components: Archery Reporter: Alenka Frim It seems there is some cmake target issue for {{clang-format}} and {{clang-tidy}} options when running {{archery lint}} on M1: {code:java} ... -- Build files have been written to: /private/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1mgn/T/arrow-lint-g7drna9_/cpp-buildninja: error: unknown target 'check-format' {code} [https://gist.github.com/AlenkaF/f60e24549529cd096bc9c975bcb71179] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17873) Writing Arrow Files using C#.
N Gautam Animesh created ARROW-17873: Summary: Writing Arrow Files using C#. Key: ARROW-17873 URL: https://issues.apache.org/jira/browse/ARROW-17873 Project: Apache Arrow Issue Type: Improvement Reporter: N Gautam Animesh Was working with Arrow along with C# and wanted to know a way to write to an arrow file using C#. Do let me know if there's anything regarding this. Was not able to find anything on the internet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17872) [CI] Cache dependencies on macOS builds
Antoine Pitrou created ARROW-17872: -- Summary: [CI] Cache dependencies on macOS builds Key: ARROW-17872 URL: https://issues.apache.org/jira/browse/ARROW-17872 Project: Apache Arrow Issue Type: Wish Components: C++, Continuous Integration, GLib, Python Reporter: Antoine Pitrou Our macOS CI builds on Github Actions usually take at least 10 minutes installing dependencies from Homebrew (because of compiling from source?). It would be nice to cache those, especially as they probably don't change often. -- This message was sent by Atlassian Jira (v8.20.10#820010)