[jira] [Created] (ARROW-16297) [R] Improve detection of ARROW_*_URL variables for offline build
Karl Dunkle Werner created ARROW-16297: -- Summary: [R] Improve detection of ARROW_*_URL variables for offline build Key: ARROW-16297 URL: https://issues.apache.org/jira/browse/ARROW-16297 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Environment: *nix offline builds Reporter: Karl Dunkle Werner Assignee: Karl Dunkle Werner As [~npr] mentioned in [https://github.com/apache/arrow/pull/12849#issuecomment-1101489333,] the current code in {{nixlibs.R}} doesn't handle components that have multiple words (because of the way it parses variable names from filenames). Until now, we've had a special case for the AWS variables, but {{ARROW_GOOGLE_CLOUD_CPP_URL}} and {{ARROW_NLOHMANN_JSON_URL}} also need handling. Instead of adding special cases, we can provide the correct {{ARROW_*_URL}} values with the new bash script added as part of ARROW-15092. I'll add a PR. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-14210) [C++] CMAKE_AR is not passed to bzip2 thirdparty dependency
Karl Dunkle Werner created ARROW-14210: -- Summary: [C++] CMAKE_AR is not passed to bzip2 thirdparty dependency Key: ARROW-14210 URL: https://issues.apache.org/jira/browse/ARROW-14210 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 5.0.0 Reporter: Karl Dunkle Werner It seems like the {{AR}} or {{CMAKE_AR}} variables aren't getting passed for the bzip2 build, which causes if to fail if we're doing a {{BUNDLED}} build and {{ar}} isn't available in the {{$PATH}} (e.g. in a conda environment). To replicate: 1. Download Arrow and start an interactive shell in a container (docker should be fine if you prefer it to podman) {code:sh} git clone --depth 1 g...@github.com:apache/arrow.git podman run -it --rm -v ./arrow:/arrow:Z docker://ursalab/amd64-ubuntu-18.04-conda-python-3.6:worker bash {code} 2. Build Arrow by running this in in the container: {code:sh} export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX export ARROW_HOME=$CONDA_PREFIX export PARQUET_HOME=$CONDA_PREFIX cd /arrow mkdir -p cpp/build pushd cpp/build cmake \ -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_AR=${AR} \ -DCMAKE_RANLIB=${RANLIB} \ -DARROW_WITH_BZ2=ON \ -DARROW_VERBOSE_THIRDPARTY_BUILD=ON \ -DARROW_JEMALLOC=OFF \ -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE \ -DARROW_DEPENDENCY_SOURCE=BUNDLED \ .. make # make[3]: ar: No such file or directory # make[3]: *** [Makefile:48: libbz2.a] Error 127 # make[2]: *** [CMakeFiles/bzip2_ep.dir/build.make:135: bzip2_ep-prefix/src/bzip2_ep-stamp/bzip2_ep-build] Error 2 # make[1]: *** [CMakeFiles/Makefile2:726: CMakeFiles/bzip2_ep.dir/all] Error 2 {code} In the cmake call above, {{ARROW_JEMALLOC}} and the SIMD flags are just to skip compiling irrelevant things. I think this line in {{ThirdpartyToolchain.cmake}} needs to be changed to pass {{CMAKE_AR}}. [https://github.com/apache/arrow/blob/bad8824d5cda0fd8337c7167729c49af868f93a5/cpp/cmake_modules/ThirdpartyToolchain.cmake#L2211] Other related issues have also needed to pass {{CMAKE_RANLIB}}, in addition to {{CMAKE_AR}}. I'm not sure if that applies here. Related: ARROW-4471, ARROW-4831 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13787) Verify third-party downloads
Karl Dunkle Werner created ARROW-13787: -- Summary: Verify third-party downloads Key: ARROW-13787 URL: https://issues.apache.org/jira/browse/ARROW-13787 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging Affects Versions: 5.0.0 Reporter: Karl Dunkle Werner Assignee: Karl Dunkle Werner I think it might be helpful to have cmake use an SHA256 hash to verify the third-party files it downloads. I can submit a PR for this. Upsides: - Downloads are further verified for integrity (in addition to the verification from https) - cmake stops complaining about missing verification (when {{ARROW_VERBOSE_THIRDPARTY_BUILD=ON}}) Downside: - Slightly more work in the future to add or update a third-party dependency. The [cmake docs|https://cmake.org/cmake/help/latest/module/ExternalProject.html] note: {quote}Specifying [URL_HASH] is strongly recommended for URL downloads, as it ensures the integrity of the downloaded content. It is also used as a check for a previously downloaded file, allowing connection to the remote location to be avoided altogether if the local directory already has a file from an earlier download that matches the specified hash. {quote} SHA256 was introduced in [cmake 2.8.7|https://blog.kitware.com/cmake-2-8-7-now-available/], released in late 2011. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13776) [C++] Offline thirdparty versions.txt is missing extensions for some files
Karl Dunkle Werner created ARROW-13776: -- Summary: [C++] Offline thirdparty versions.txt is missing extensions for some files Key: ARROW-13776 URL: https://issues.apache.org/jira/browse/ARROW-13776 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging Affects Versions: 5.0.0 Reporter: Karl Dunkle Werner Assignee: Karl Dunkle Werner The file {{cpp/thirdparty/versions.txt}} lists third-party dependencies, and the filename {{download_dependencies.sh}} should use to save them. A couple of those files, for {{aws-checksums}} and {{aws-c-event-stream}}, are missing extensions. When I try to use those files, e.g. with {{$ARROW_AWS_CHECKSUMS_URL}}, cmake has an error: {noformat} CMake Error at /usr/share/cmake/Modules/ExternalProject.cmake:1561 (message): error: do not know how to extract '/tmp/RtmpuzmuVM/R.INSTALL3f194a9055a6/arrow/arrow-thirdparty/aws-c-event-stream-v0.1.5' -- known types are .7z, .tar, .tar.bz2, .tar.gz, .tar.xz, .tbz2, .tgz, .txz and .zip {noformat} This error is fixed if I manually add {{.tar.gz}} to {{aws-checksums-v0.1.10}} and {{aws-c-event-stream-v0.1.5}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13768) [R] Allow JSON to be an optional component
Karl Dunkle Werner created ARROW-13768: -- Summary: [R] Allow JSON to be an optional component Key: ARROW-13768 URL: https://issues.apache.org/jira/browse/ARROW-13768 Project: Apache Arrow Issue Type: Task Components: R Affects Versions: 5.0.0 Reporter: Karl Dunkle Werner JSON support requires RapidJSON, a third-party dependency that might not always be available. Particularly for offline static builds (ARROW-12981), it would be nice to allow {{ARROW_JSON=OFF}}. Here's the [relevant section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L290-L292] of {{ThirdpartyToolchain.cmake}}: {code:none} if(ARROW_JSON) set(ARROW_WITH_RAPIDJSON ON) endif() {code} And the [relevant section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/r/inst/build_arrow_static.sh#L62] of the {{build_arrow_static.sh}} script. As Neal [mentioned|https://github.com/apache/arrow/pull/11001#discussion_r696723923], there's more to do than just replacing {{-DARROW_JSON=ON}} with {{-DARROW_JSON=$\{ARROW_JSON:-ON}}}. "We'll have to conditionally build some of the bindings like we do with dataset and parquet, and we'll have to conditionally skip tests." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12981) Wish: Install source package from CRAN alone
Karl Dunkle Werner created ARROW-12981: -- Summary: Wish: Install source package from CRAN alone Key: ARROW-12981 URL: https://issues.apache.org/jira/browse/ARROW-12981 Project: Apache Arrow Issue Type: Wish Components: Packaging, R Affects Versions: 4.0.1 Environment: Linux Reporter: Karl Dunkle Werner Hello, I would like to install {{Arrow}} on Linux using only CRAN, without downloading additional files from Github, Apache, or Ursa Labs. I understand this is a big ask, and might not be a priority for you all. Feel free to close if you feel that this is out of scope. Why is a CRAN-only installation useful? # It's common for organizations to set up firewalls that prevent arbitrary downloads, but allow access to their own internal CRAN mirror. ** Sometimes these firewalls also allow requests to Github, but often not. # On a broader level, my favorite thing about R is CRAN, the CRAN maintainers, and their [policy|https://cran.r-project.org/web/packages/policies.html#Source-packages] that "Source packages may not contain any form of binary executable code." By distributing most of the Arrow code separately (either as source C++ or a compiled library), automated code archives and other source-based tools become much less useful. Of course, {{arrow}} isn't the only R package to depend on external libraries or distribute code separately. If a CRAN-only approach isn't viable, it would still be useful to have an all-offline method. I'm also having trouble getting an offline install to work, even with a local copy of the Arrow repo. (See the bottom of the script below.) What does does installing offline look like now? Here's a bash script that approximates installing behind a firewall. {code:sh} git clone --depth 1 g...@github.com:apache/arrow.git test_arrow cd test_arrow wget 'https://cran.r-project.org/src/contrib/arrow_4.0.1.tar.gz' # Set up a temporary R library (optional) mkdir test_r_lib export R_LIBS_USER=test_r_lib export ARROW_R_DEV=true export LIBARROW_MINIMAL=false export LIBARROW_DOWNLOAD=false export LIBARROW_BINARY=false export LIBARROW_BUILD=true # These are all of the direct dependencies, including Suggests # This isn't required if the packages are already installed Rscript -e "install.packages(c('assertthat', 'bit64', 'purrr', 'R6', 'rlang', 'tidyselect', 'vctrs', 'cpp11', 'decor', 'distro', 'dplyr', 'hms', 'knitr', 'lubridate', 'pkgload', 'reticulate', 'rmarkdown', 'stringr', 'testthat', 'tibble', 'withr'))" # Disable your internet connection here. # Now try to install the R package we downloaded with wget. # This is an approximation of being behind a firewall. Rscript -e 'install.packages("arrow_4.0.1.tar.gz", repos=NULL)' # It successfully installs the R component, but not the C++ library, # even with LIBARROW_BUILD=true Rscript -e "arrow::arrow_available()" # [1] FALSE # As mentioned in the installation vignette, # we can R CMD INSTALL in the git repo. R CMD INSTALL r # This will try to build the C++ library, but fails when mimalloc and # jemalloc can't be downloaded from Github. # (Seems not to be affected by LIBARROW_DOWNLOAD=false). # When C++ compilation fails, the R component still installs. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12853) [R] Install fails on R 4.1.0
Karl Dunkle Werner created ARROW-12853: -- Summary: [R] Install fails on R 4.1.0 Key: ARROW-12853 URL: https://issues.apache.org/jira/browse/ARROW-12853 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 4.0.0 Environment: R 4.1.0 Reporter: Karl Dunkle Werner Hi, I noticed my installation failed after updating R to the newly released 4.1.0. Arrow compiles without error, but installation fails with a segfault when R tests whether the package can be loaded. Here are the relevant environment variables I've set: LIBARROW_MINIMAL: {{false}} LIBARROW_BINARY: {{ubuntu-20.04}} (also fails with {{false}}) ARROW_R_DEV: {{true}} Let me know if there are other configurations you'd like me to test. Related: ARROW-12824 The install log: {noformat} install.packages("arrow") Installing package into ‘/home/karl/.R/4.1’ (as ‘lib’ is unspecified) trying URL 'https://cloud.r-project.org/src/contrib/arrow_4.0.0.1.tar.gz' Content type 'application/x-gzip' length 426715 bytes (416 KB) == downloaded 416 KB * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Using ubuntu-18.04 binary for ubuntu-20.04 trying URL 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-18.04/arrow-4.0.0.1.zip' Content type 'binary/octet-stream' length 16007220 bytes (15.3 MB) == downloaded 15.3 MB *** Successfully retrieved C++ binaries for ubuntu-18.04 Binary package requires libcurl and openssl If installation fails, retry after installing those system requirements PKG_CFLAGS=-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 PKG_LIBS=-L/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/lib -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto -lcurl ** libs g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array.cpp -o array.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array_to_vector.cpp -o array_to_vector.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arraydata.cpp -o arraydata.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arrowExports.cpp -o arrowExports.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c buffer.cpp -o buffer.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic -g -O2 -ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto -ffat-lto-objects
[jira] [Created] (ARROW-10511) [Python] Timezone error in Table.to_pandas()
Karl Dunkle Werner created ARROW-10511: -- Summary: [Python] Timezone error in Table.to_pandas() Key: ARROW-10511 URL: https://issues.apache.org/jira/browse/ARROW-10511 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: Ubuntu 20.04, Python 3.8.6, Pandas 1.1.4 Reporter: Karl Dunkle Werner We're having an issue with timezones in the Table {{to_pandas}} methods. See example below. {code:python} import pyarrow as pa import pandas as pd print(pa.__version__) # 2.0.0 df = pd.DataFrame({"time": pd.to_datetime([0, 0])}) time_field = pa.field("time",type=pa.timestamp("ms", tz="utc"), nullable=False) schema = pa.schema([time_field]) tab = pa.Table.from_pandas(df, schema) tab.to_pandas() # File ".../pandas_compat.py", line 777, in table_to_blockmanager # table = _add_any_metadata(table, pandas_metadata) # File ".../pandas_compat.py", line 1184, in _add_any_metadata # tz = col_meta['metadata']['timezone'] # TypeError: 'NoneType' object is not subscriptable {code} Related issues: https://issues.apache.org/jira/browse/ARROW-9223 https://issues.apache.org/jira/browse/ARROW-9528 https://github.com/catalyst-cooperative/pudl/issues/705 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9946) ParquetFileWriter segfaults when `sink` is a string
Karl Dunkle Werner created ARROW-9946: - Summary: ParquetFileWriter segfaults when `sink` is a string Key: ARROW-9946 URL: https://issues.apache.org/jira/browse/ARROW-9946 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 1.0.1 Environment: Ubuntu 20.04 Reporter: Karl Dunkle Werner Hello again! I have another minor R arrow issue. The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a "string which is interpreted as a file path". However, when I try to use a string, I get a segfault because the memory isn't mapped. Maybe this is a separate request, but it would also be helpful to have documentation for the methods of the writer created by {{ParquetFileWriter$create()}}. Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html] {code:r} library(arrow) sch = schema(a = float32()) writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet") #> *** caught segfault *** #> address 0x1417d, cause 'memory not mapped' #> #> Traceback: #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, arrow_properties) #> 2: shared_ptr_is_null(xp) #> 3: shared_ptr(ParquetFileWriter, parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, arrow_properties)) #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet") # This works as expected: sink = FileOutputStream$create("test.parquet") writer = ParquetFileWriter$create(schema = sch, sink = sink) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9557) [R] Iterating over parquet columns is slow in R
Karl Dunkle Werner created ARROW-9557: - Summary: [R] Iterating over parquet columns is slow in R Key: ARROW-9557 URL: https://issues.apache.org/jira/browse/ARROW-9557 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 1.0.0 Reporter: Karl Dunkle Werner I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python. An example is below, though it's certainly possible I've done my benchmarking incorrectly. Python setup and benchmarking: {code:python} import numpy as np import pyarrow import pyarrow.parquet as pq from numpy.random import default_rng from time import time # Create a large, random array to save. ~1.5 GB. rng = default_rng(seed = 1) n_col = 4000 n_row = 5 mat = rng.standard_normal((n_col, n_row)) col_names = [str(nm) for nm in range(n_col)] tab = pyarrow.Table.from_arrays(mat, names=col_names) pq.write_table(tab, "test_tab.parquet", use_dictionary=False) # How long does it take to read the whole thing in python? time_start = time() _ = pq.read_table("test_tab.parquet") elapsed = time() - time_start print(elapsed) # under 1 second on my computer time_start = time() f = pq.ParquetFile("test_tab.pq") for one_col in col_names: _ = f.read(one_col).column(0) elapsed = time() - time_start print(elapsed) # about 2 seconds {code} R benchmarking, using the same {{test_tab.parquet}} file {code:r} library(arrow) read_by_column <- function(f) { table = ParquetFileReader$create(f) cols <- as.character(0:3999) purrr::walk(cols, ~table$ReadTable(.)$column(0)) } bench::mark( read_parquet("test_tab.parquet", as_data_frame=FALSE), # 0.6 s read_parquet("test_tab.parquet", as_data_frame=TRUE), # 1 s read_by_column("test_tab.parquet"),# 100 s check=FALSE ) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098535#comment-17098535 ] Karl Dunkle Werner commented on ARROW-8556: --- Update: {{LIBARROW_BINARY=ubuntu-18.04}} seems to work with Ubuntu 20.04 too. (It compiles; I haven't run the tests.) > [R] zstd symbol not found if there are multiple installations of zstd > - > > Key: ARROW-8556 > URL: https://issues.apache.org/jira/browse/ARROW-8556 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: Ubuntu 19.10 > R 3.6.1 >Reporter: Karl Dunkle Werner >Priority: Major > > I would like to install the `arrow` R package on my Ubuntu 19.10 system. > Prebuilt binaries are unavailable, and I want to enable compression, so I set > the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks > like the package is able to compile, but can't be loaded. I'm able to install > correctly if I don't set the {{LIBARROW_MINIMAL}} variable. > Here's the error I get: > {code:java} > ** testing if installed package can be loaded from temporary location > Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath > = DLLpath, ...): > unable to load shared object > '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': > ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: > ZSTD_initCStream > Error: loading failed > Execution halted > ERROR: loading failed > * removing ‘~/.R/3.6/arrow’ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094007#comment-17094007 ] Karl Dunkle Werner commented on ARROW-8556: --- Update: I remembered dev packages. I had libzstd-dev 1.4.3 installed as a dependency of libgdal-dev. After uninstalling it, I was able to install arrow. Logs are below. {noformat} * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Generating code with data-raw/codegen.R Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory trying URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip' Error in download.file(from_url, to_file, quiet = quietly) : cannot open URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip' trying URL 'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-0.17.0/apache-arrow-0.17.0.tar.gz' Content type 'application/x-gzip' length 6460548 bytes (6.2 MB) == downloaded 6.2 MB*** Successfully retrieved C++ source *** Building C++ libraries rm: cannot remove 'src/*.o': No such file or directory *** Building with MAKEFLAGS= -j4 arrow with SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp BUILD_DIR=/tmp/Rtmp9loTsA/file46055b57ae53 DEST_DIR=libarrow/arrow-0.17.0 CMAKE=/usr/bin/cmake ++ pwd + : /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow + : /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp + : /tmp/Rtmp9loTsA/file46055b57ae53 + : libarrow/arrow-0.17.0 + : /usr/bin/cmake ++ cd /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp ++ pwd + SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp ++ mkdir -p libarrow/arrow-0.17.0 ++ cd libarrow/arrow-0.17.0 ++ pwd + DEST_DIR=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0 + '[' '' = '' ']' + which ninja + CMAKE_GENERATOR=Ninja + '[' false = false ']' + ARROW_JEMALLOC=ON + ARROW_WITH_BROTLI=ON + ARROW_WITH_BZ2=ON + ARROW_WITH_LZ4=ON + ARROW_WITH_SNAPPY=ON + ARROW_WITH_ZLIB=ON + ARROW_WITH_ZSTD=ON + mkdir -p /tmp/Rtmp9loTsA/file46055b57ae53 + pushd /tmp/Rtmp9loTsA/file46055b57ae53 /tmp/Rtmp9loTsA/file46055b57ae53 /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow + /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -DOPENSSL_USE_STATIC_LIBS=ON -G Ninja /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp -- Building using CMake version: 3.13.4 -- The C compiler identification is GNU 9.2.1 -- The CXX compiler identification is GNU 9.2.1 -- Check for working C compiler: /usr/lib/ccache/cc -- Check for working C compiler: /usr/lib/ccache/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/lib/ccache/c++ -- Check for working CXX compiler: /usr/lib/ccache/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 0.17.0 (full: '0.17.0') -- Arrow SO version: 17 (full: 17.0.0) -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") -- clang-tidy not found -- clang-format not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found -- Found Python3: /usr/bin/python3.7 (found version "3.7.5") found components: Interpreter -- Using ccache: /usr/bin/ccache -- Found cpplint executable at /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp/build-support/cpplint.py -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Success -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Success -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- Arrow build warning level: PRODUCTION Using ld linker Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: RELEASE -- Using AUTO approach to find dependencies -- ARROW_AWSSDK_BUILD_VERSION: 1.7.160 -- ARROW_BOOST_BUILD_VERSION: 1.71.0 -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 --
[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091925#comment-17091925 ] Karl Dunkle Werner commented on ARROW-8556: --- {noformat} Installing package into ‘/home/karl/test_arrow’ (as ‘lib’ is unspecified) --- Please select a CRAN mirror for use in this session --- trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.17.0.tar.gz' Content type 'application/x-gzip' length 242534 bytes (236 KB) == downloaded 236 KB* installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Generating code with data-raw/codegen.R Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory trying URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip' Error in download.file(from_url, to_file, quiet = quietly) : cannot open URL 'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip' trying URL 'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-0.17.0/apache-arrow-0.17.0.tar.gz' Content type 'application/x-gzip' length 6460548 bytes (6.2 MB) == downloaded 6.2 MB*** Successfully retrieved C++ source *** Building C++ libraries rm: cannot remove 'src/*.o': No such file or directory *** Building with MAKEFLAGS= -j4 arrow with SOURCE_DIR=/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp BUILD_DIR=/tmp/RtmptP2CaW/file476e6fba345b DEST_DIR=libarrow/arrow-0.17.0 CMAKE=/usr/bin/cmake ++ pwd + : /tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow + : /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp + : /tmp/RtmptP2CaW/file476e6fba345b + : libarrow/arrow-0.17.0 + : /usr/bin/cmake ++ cd /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp ++ pwd + SOURCE_DIR=/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp ++ mkdir -p libarrow/arrow-0.17.0 ++ cd libarrow/arrow-0.17.0 ++ pwd + DEST_DIR=/tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow/libarrow/arrow-0.17.0 + '[' '' = '' ']' + which ninja + CMAKE_GENERATOR=Ninja + '[' false = false ']' + ARROW_JEMALLOC=ON + ARROW_WITH_BROTLI=ON + ARROW_WITH_BZ2=ON + ARROW_WITH_LZ4=ON + ARROW_WITH_SNAPPY=ON + ARROW_WITH_ZLIB=ON + ARROW_WITH_ZSTD=ON + mkdir -p /tmp/RtmptP2CaW/file476e6fba345b + pushd /tmp/RtmptP2CaW/file476e6fba345b /tmp/RtmptP2CaW/file476e6fba345b /tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow + /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow/libarrow/arrow-0.17.0 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON -DOPENSSL_USE_STATIC_LIBS=ON -G Ninja /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp -- Building using CMake version: 3.13.4 -- The C compiler identification is GNU 9.2.1 -- The CXX compiler identification is GNU 9.2.1 -- Check for working C compiler: /usr/lib/ccache/cc -- Check for working C compiler: /usr/lib/ccache/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/lib/ccache/c++ -- Check for working CXX compiler: /usr/lib/ccache/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 0.17.0 (full: '0.17.0') -- Arrow SO version: 17 (full: 17.0.0) -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") -- clang-tidy not found -- clang-format not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found -- Found Python3: /usr/bin/python3.7 (found version "3.7.5") found components: Interpreter -- Using ccache: /usr/bin/ccache -- Found cpplint executable at /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp/build-support/cpplint.py -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Success -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Success -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- Arrow build warning level: PRODUCTION Using ld linker Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: RELEASE -- Using AUTO approach to find dependencies
[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091700#comment-17091700 ] Karl Dunkle Werner commented on ARROW-8556: --- Great! If you want to get to the bottom of it, I would be happy to run commands you send me. I think most 19.10 users will be moving to 20.04 soon, so this might only be worth it if 20.04 experiences the same issue. > [R] zstd symbol not found on Ubuntu 19.10 > - > > Key: ARROW-8556 > URL: https://issues.apache.org/jira/browse/ARROW-8556 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: Ubuntu 19.10 > R 3.6.1 >Reporter: Karl Dunkle Werner >Priority: Major > > I would like to install the `arrow` R package on my Ubuntu 19.10 system. > Prebuilt binaries are unavailable, and I want to enable compression, so I set > the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks > like the package is able to compile, but can't be loaded. I'm able to install > correctly if I don't set the {{LIBARROW_MINIMAL}} variable. > Here's the error I get: > {code:java} > ** testing if installed package can be loaded from temporary location > Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath > = DLLpath, ...): > unable to load shared object > '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': > ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: > ZSTD_initCStream > Error: loading failed > Execution halted > ERROR: loading failed > * removing ‘~/.R/3.6/arrow’ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090003#comment-17090003 ] Karl Dunkle Werner edited comment on ARROW-8556 at 4/22/20, 8:24 PM: - Sure! The logs are pasted below. * Setting {{LIBARROW_BINARY=ubuntu-18.04}} and reinstalling works. * I did not have zstd installed. I just installed version 1.4.3. * After installing zstd (and resetting {{LIBARROW_BINARY=true}}), installation fails again. {noformat} > install.packages("arrow") * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** No C++ binaries found for ubuntu-19.10 *** Successfully retrieved C++ source *** Building C++ libraries arrow PKG_CFLAGS=-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW PKG_LIBS=-L/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/lib -larrow_dataset -lparquet -larrow -larrow -larrow_dataset -lbrotlidec-static -lbrotlienc-static -lbrotlizzz-static -ljemalloc_pic -llz4 -lparquet -lsnappy -lthrift -lthriftz ** libs g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array.cpp -o array.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array_from_vector.cpp -o array_from_vector.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array_to_vector.cpp -o array_to_vector.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arraydata.cpp -o arraydata.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arrowExports.cpp -o arrowExports.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c buffer.cpp -o buffer.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c chunkedarray.cpp -o chunkedarray.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c compression.cpp -o compression.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c compute.cpp -o compute.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2
[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090003#comment-17090003 ] Karl Dunkle Werner commented on ARROW-8556: --- Sure! The logs are pasted below. * Setting {{LIBARROW_BINARY=UBUNTU-18.04}} and reinstalling works. * I did not have zstd installed. I just installed version 1.4.3. * After installing zstd (and resetting {{LIBARROW_BINARY=true}}), installation fails again. {noformat} > install.packages("arrow") * installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** No C++ binaries found for ubuntu-19.10 *** Successfully retrieved C++ source *** Building C++ libraries arrow PKG_CFLAGS=-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW PKG_LIBS=-L/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/lib -larrow_dataset -lparquet -larrow -larrow -larrow_dataset -lbrotlidec-static -lbrotlienc-static -lbrotlizzz-static -ljemalloc_pic -llz4 -lparquet -lsnappy -lthrift -lthriftz ** libs g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array.cpp -o array.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array_from_vector.cpp -o array_from_vector.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c array_to_vector.cpp -o array_to_vector.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arraydata.cpp -o arraydata.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c arrowExports.cpp -o arrowExports.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c buffer.cpp -o buffer.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c chunkedarray.cpp -o chunkedarray.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c compression.cpp -o compression.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c compute.cpp -o compute.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include -DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include" -fpic -g -O2 -fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=.
[jira] [Created] (ARROW-8556) [R] Installation fails with `LIBARROW_MINIMAL=false`
Karl Dunkle Werner created ARROW-8556: - Summary: [R] Installation fails with `LIBARROW_MINIMAL=false` Key: ARROW-8556 URL: https://issues.apache.org/jira/browse/ARROW-8556 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.17.0 Environment: Ubuntu 19.10 R 3.6.1 Reporter: Karl Dunkle Werner I would like to install the `arrow` R package on my Ubuntu 19.10 system. Prebuilt binaries are unavailable, and I want to enable compression, so I set the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks like the package is able to compile, but can't be loaded. I'm able to install correctly if I don't set the {{LIBARROW_MINIMAL}} variable. Here's the error I get: {code:java} ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: ZSTD_initCStream Error: loading failed Execution halted ERROR: loading failed * removing ‘~/.R/3.6/arrow’ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7789) [R] Unknown error when using arrow::write_feather() in R 3.5.3
[ https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045843#comment-17045843 ] Karl Dunkle Werner commented on ARROW-7789: --- Reading hits the same issues. > [R] Unknown error when using arrow::write_feather() in R 3.5.3 > --- > > Key: ARROW-7789 > URL: https://issues.apache.org/jira/browse/ARROW-7789 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Martin >Priority: Minor > > Unknown error when using arrow::write_feather() in R 3.5.3 > pb = as.data.frame(seq(1:100)) > pbFilename <- file.path(getwd(), "reproduceBug.feather") > arrow::write_feather(x = pb, sink = pbFilename) > >Error in exists(name, envir = envir, inherits = FALSE) : > > use of NULL environment is defunct > > packageVersion('arrow') > [1] ‘0.15.1.1’ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7789) [R] Unknown error when using arrow::write_feather() in R 3.5.3
[ https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045840#comment-17045840 ] Karl Dunkle Werner commented on ARROW-7789: --- I'm getting the same error when the R.oo package is loaded (not even attached). Here's a reprex: {code:r} loadNamespace("R.oo") #> arrow::write_parquet(mtcars, tempfile()) #> Error in exists(name, envir = envir, inherits = FALSE): use of NULL environment is defunct sessionInfo() #> R version 3.6.1 (2019-07-05) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 19.10 #> Matrix products: default #> BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 #> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.7.so #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> loaded via a namespace (and not attached): #> [1] tidyselect_1.0.0 bit_1.1-15.2 compiler_3.6.1magrittr_1.5 #> [5] assertthat_0.2.1 R6_2.4.1 glue_1.3.1Rcpp_1.0.3 #> [9] bit64_0.9-7 vctrs_0.2.3 R.methodsS3_1.8.0 arrow_0.16.0.2 #> [13] rlang_0.4.4 R.oo_1.23.0 purrr_0.3.3 {code} > [R] Unknown error when using arrow::write_feather() in R 3.5.3 > --- > > Key: ARROW-7789 > URL: https://issues.apache.org/jira/browse/ARROW-7789 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Martin >Priority: Minor > > Unknown error when using arrow::write_feather() in R 3.5.3 > pb = as.data.frame(seq(1:100)) > pbFilename <- file.path(getwd(), "reproduceBug.feather") > arrow::write_feather(x = pb, sink = pbFilename) > >Error in exists(name, envir = envir, inherits = FALSE) : > > use of NULL environment is defunct > > packageVersion('arrow') > [1] ‘0.15.1.1’ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data
Karl Dunkle Werner created ARROW-7345: - Summary: [Python] Writing partitions with NaNs silently drops data Key: ARROW-7345 URL: https://issues.apache.org/jira/browse/ARROW-7345 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Reporter: Karl Dunkle Werner When writing a partitioned table, if the partitioning column has NA values, they're silently dropped. I think it would be helpful if there was a warning. Even better, from my perspective, would be writing out those partitions with a directory name like {{partition_col=NaN}}. Here's a small example where only the {{b = 2}} group is written out and the {{b = NaN}} group is dropped. {code:python} import os import tempfile import pyarrow.json import pyarrow.parquet from pathlib import Path # Create a dataset with NaN: json_str = """ {"a": 1, "b": 2} {"a": 2, "b": null} """ with tempfile.NamedTemporaryFile() as tf: tf = Path(tf.name) tf.write_text(json_str) table = pyarrow.json.read_json(tf) # Write out a partitioned dataset, using the NaN-containing column with tempfile.TemporaryDirectory() as out_dir: pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"]) print(os.listdir(out_dir)) read_table = pyarrow.parquet.read_table(out_dir) print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row") # Output: #> ['b=2.0'] #> Wrote out 2 rows, read back 1 row {code} It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs
[ https://issues.apache.org/jira/browse/ARROW-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965692#comment-16965692 ] Karl Dunkle Werner commented on ARROW-7035: --- I'd be happy to write a PR! I'll have time to work on it in a few weeks. > [R] Default arguments are unclear in write_parquet docs > --- > > Key: ARROW-7035 > URL: https://issues.apache.org/jira/browse/ARROW-7035 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.15.0 > Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow > 0.15.0. >Reporter: Karl Dunkle Werner >Priority: Minor > Labels: documentation > Fix For: 1.0.0 > > > Thank you so much for adding support for reading and writing parquet files in > R! I have a few questions about the user interface and optional arguments, > but I want to highlight how great it is to have this useful filetype to pass > data back and forth. > The defaults for the optional arguments in {{arrow::write_parquet}} aren't > always clear. Here were my questions after reading the help docs from > {{write_parquet}}: > * What's the default {{version}}? Should a user prefer "2.0" for new > projects? > * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, > {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.) > * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least > some of the time. > * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}? > * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default? > As someone who works in both R and Python, I was a little surprised when > pyarrow uses snappy compression by default, but R's default is uncompressed. > My preference would be having the same default arguments, but that might be a > fringe use-case. > While I was digging into this, I was surprised that > {{ParquetReaderProperties}} is exported and documented, but > {{ParquetWriterProperties}} isn't. Is that intentional? > Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs
Karl Dunkle Werner created ARROW-7035: - Summary: [R] Default arguments are unclear in write_parquet docs Key: ARROW-7035 URL: https://issues.apache.org/jira/browse/ARROW-7035 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 0.15.0 Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 0.15.0. Reporter: Karl Dunkle Werner Fix For: 0.15.1 Thank you so much for adding support for reading and writing parquet files in R! I have a few questions about the user interface and optional arguments, but I want to highlight how great it is to have this useful filetype to pass data back and forth. The defaults for the optional arguments in {{arrow::write_parquet}} aren't always clear. Here were my questions after reading the help docs from {{write_parquet}}: * What's the default {{version}}? Should a user prefer "2.0" for new projects? * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.) * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least some of the time. * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}? * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default? As someone who works in both R and Python, I was a little surprised when pyarrow uses snappy compression by default, but R's default is uncompressed. My preference would be having the same default arguments, but that might be a fringe use-case. While I was digging into this, I was surprised that {{ParquetReaderProperties}} is exported and documented, but {{ParquetWriterProperties}} isn't. Is that intentional? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6142) [R] Install instructions on linux could be clearer
Karl Dunkle Werner created ARROW-6142: - Summary: [R] Install instructions on linux could be clearer Key: ARROW-6142 URL: https://issues.apache.org/jira/browse/ARROW-6142 Project: Apache Arrow Issue Type: Wish Components: R Affects Versions: 0.14.1 Environment: Ubuntu 19.04 Reporter: Karl Dunkle Werner Fix For: 0.15.0 Installing R packages on Linux is almost always from source, which means Arrow needs some system dependencies. The existing help message (from arrow::install_arrow()) is very helpful in pointing that out, but it's still a heavy lift for users who install R packages from source but don't plan to develop Arrow itself. Here are a couple of things that could make things slightly smoother: # I would be very grateful if the install_arrow() message or installation page told me which libraries were essential to make the R package work. # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" instead of just "PPA" would have caused me less confusion. (Others may differ) # A snap package would be easier than installing a new apt address, but I understand that building for snap would be more packaging work and only benefits Ubuntu users. Thanks for making R bindings, and congratulations on the CRAN release! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet
Karl Dunkle Werner created ARROW-5480: - Summary: [Python] Pandas categorical type doesn't survive a round-trip through parquet Key: ARROW-5480 URL: https://issues.apache.org/jira/browse/ARROW-5480 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.13.0, 0.11.1 Environment: python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 5.0.0-15-generic machine: x86_64 processor: x86_64 byteorder: little pandas: 0.24.2 numpy: 1.16.4 pyarrow: 0.13.0 Reporter: Karl Dunkle Werner Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category. The same thing happens if the category is numeric -- a numeric category is read back as int64. In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not. In the scheme of things, this isn't a big deal, but it's a small surprise. {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) df.dtypes # category # This works: pa.Table.from_pandas(df).to_pandas().dtypes # category df.to_parquet("categories.parquet") # This reads back object, but I expected category pd.read_parquet("categories.parquet").dtypes # object # Numeric categories have the same issue: df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) df_num.dtypes # category pa.Table.from_pandas(df_num).to_pandas().dtypes # category df_num.to_parquet("categories_num.parquet") # This reads back int64, but I expected category pd.read_parquet("categories_num.parquet").dtypes # int64 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)