[jira] [Created] (ARROW-16297) [R] Improve detection of ARROW_*_URL variables for offline build

2022-04-23 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-16297:
--

 Summary: [R] Improve detection of ARROW_*_URL variables for 
offline build
 Key: ARROW-16297
 URL: https://issues.apache.org/jira/browse/ARROW-16297
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
 Environment: *nix offline builds
Reporter: Karl Dunkle Werner
Assignee: Karl Dunkle Werner


As [~npr] mentioned in 
[https://github.com/apache/arrow/pull/12849#issuecomment-1101489333,] the 
current code in {{nixlibs.R}} doesn't handle components that have multiple 
words (because of the way it parses variable names from filenames). Until now, 
we've had a special case for the AWS variables, but 
{{ARROW_GOOGLE_CLOUD_CPP_URL}} and {{ARROW_NLOHMANN_JSON_URL}} also need 
handling. Instead of adding special cases, we can provide the correct 
{{ARROW_*_URL}} values with the new bash script added as part of ARROW-15092.

 

I'll add a PR.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-14210) [C++] CMAKE_AR is not passed to bzip2 thirdparty dependency

2021-10-03 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-14210:
--

 Summary: [C++] CMAKE_AR is not passed to bzip2 thirdparty 
dependency
 Key: ARROW-14210
 URL: https://issues.apache.org/jira/browse/ARROW-14210
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 5.0.0
Reporter: Karl Dunkle Werner


It seems like the {{AR}} or {{CMAKE_AR}} variables aren't getting passed for 
the bzip2 build, which causes if to fail if we're doing a {{BUNDLED}} build and 
{{ar}} isn't available in the {{$PATH}} (e.g. in a conda environment).

To replicate:
 1. Download Arrow and start an interactive shell in a container 
 (docker should be fine if you prefer it to podman)
{code:sh}
git clone --depth 1 g...@github.com:apache/arrow.git
podman run -it --rm -v ./arrow:/arrow:Z 
docker://ursalab/amd64-ubuntu-18.04-conda-python-3.6:worker bash
{code}
2. Build Arrow by running this in in the container:
{code:sh}
export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX
export ARROW_HOME=$CONDA_PREFIX
export PARQUET_HOME=$CONDA_PREFIX

cd /arrow
mkdir -p cpp/build
pushd cpp/build

cmake \
  -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DCMAKE_AR=${AR} \
  -DCMAKE_RANLIB=${RANLIB} \
  -DARROW_WITH_BZ2=ON \
  -DARROW_VERBOSE_THIRDPARTY_BUILD=ON \
  -DARROW_JEMALLOC=OFF \
  -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE \
  -DARROW_DEPENDENCY_SOURCE=BUNDLED \
  ..
make
# make[3]: ar: No such file or directory
# make[3]: *** [Makefile:48: libbz2.a] Error 127
# make[2]: *** [CMakeFiles/bzip2_ep.dir/build.make:135: 
bzip2_ep-prefix/src/bzip2_ep-stamp/bzip2_ep-build] Error 2
# make[1]: *** [CMakeFiles/Makefile2:726: CMakeFiles/bzip2_ep.dir/all] Error 2

{code}
In the cmake call above, {{ARROW_JEMALLOC}} and the SIMD flags are just to skip 
compiling irrelevant things.

I think this line in {{ThirdpartyToolchain.cmake}} needs to be changed to pass 
{{CMAKE_AR}}.
 
[https://github.com/apache/arrow/blob/bad8824d5cda0fd8337c7167729c49af868f93a5/cpp/cmake_modules/ThirdpartyToolchain.cmake#L2211]

Other related issues have also needed to pass {{CMAKE_RANLIB}}, in addition to 
{{CMAKE_AR}}. I'm not sure if that applies here.

 
 Related: ARROW-4471, ARROW-4831



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13787) Verify third-party downloads

2021-08-27 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-13787:
--

 Summary: Verify third-party downloads
 Key: ARROW-13787
 URL: https://issues.apache.org/jira/browse/ARROW-13787
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Affects Versions: 5.0.0
Reporter: Karl Dunkle Werner
Assignee: Karl Dunkle Werner


I think it might be helpful to have cmake use an SHA256 hash to verify the 
third-party files it downloads. I can submit a PR for this.

Upsides:
 - Downloads are further verified for integrity (in addition to the 
verification from https)
 - cmake stops complaining about missing verification (when 
{{ARROW_VERBOSE_THIRDPARTY_BUILD=ON}})

Downside:
 - Slightly more work in the future to add or update a third-party dependency.

The [cmake 
docs|https://cmake.org/cmake/help/latest/module/ExternalProject.html] note:
{quote}Specifying [URL_HASH] is strongly recommended for URL downloads, as it 
ensures the integrity of the downloaded content. It is also used as a check for 
a previously downloaded file, allowing connection to the remote location to be 
avoided altogether if the local directory already has a file from an earlier 
download that matches the specified hash.
{quote}
SHA256 was introduced in [cmake 
2.8.7|https://blog.kitware.com/cmake-2-8-7-now-available/], released in late 
2011.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13776) [C++] Offline thirdparty versions.txt is missing extensions for some files

2021-08-26 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-13776:
--

 Summary: [C++] Offline thirdparty versions.txt is missing 
extensions for some files
 Key: ARROW-13776
 URL: https://issues.apache.org/jira/browse/ARROW-13776
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Affects Versions: 5.0.0
Reporter: Karl Dunkle Werner
Assignee: Karl Dunkle Werner


The file {{cpp/thirdparty/versions.txt}} lists third-party dependencies, and 
the filename {{download_dependencies.sh}} should use to save them.
A couple of those files, for {{aws-checksums}} and {{aws-c-event-stream}}, are 
missing extensions.
When I try to use those files, e.g. with {{$ARROW_AWS_CHECKSUMS_URL}}, cmake 
has an error:

{noformat}
CMake Error at /usr/share/cmake/Modules/ExternalProject.cmake:1561 (message):
  error: do not know how to extract
  
'/tmp/RtmpuzmuVM/R.INSTALL3f194a9055a6/arrow/arrow-thirdparty/aws-c-event-stream-v0.1.5'
  -- known types are .7z, .tar, .tar.bz2, .tar.gz, .tar.xz, .tbz2, .tgz, .txz
  and .zip
{noformat}

This error is fixed if I manually add {{.tar.gz}} to {{aws-checksums-v0.1.10}} 
and {{aws-c-event-stream-v0.1.5}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13768) [R] Allow JSON to be an optional component

2021-08-26 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-13768:
--

 Summary: [R] Allow JSON to be an optional component
 Key: ARROW-13768
 URL: https://issues.apache.org/jira/browse/ARROW-13768
 Project: Apache Arrow
  Issue Type: Task
  Components: R
Affects Versions: 5.0.0
Reporter: Karl Dunkle Werner


JSON support requires RapidJSON, a third-party dependency that might not always 
be available. Particularly for offline static builds (ARROW-12981), it would be 
nice to allow {{ARROW_JSON=OFF}}.

Here's the [relevant 
section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L290-L292]
 of {{ThirdpartyToolchain.cmake}}:
{code:none}
if(ARROW_JSON)
  set(ARROW_WITH_RAPIDJSON ON)
endif()
{code}
And the [relevant 
section|https://github.com/apache/arrow/blob/64bef2ad8d9cd2fea122cfa079f8ca3fea8cdf5d/r/inst/build_arrow_static.sh#L62]
 of the {{build_arrow_static.sh}} script.

As Neal 
[mentioned|https://github.com/apache/arrow/pull/11001#discussion_r696723923], 
there's more to do than just replacing {{-DARROW_JSON=ON}} with 
{{-DARROW_JSON=$\{ARROW_JSON:-ON}}}. "We'll have to conditionally build some of 
the bindings like we do with dataset and parquet, and we'll have to 
conditionally skip tests."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12981) Wish: Install source package from CRAN alone

2021-06-04 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-12981:
--

 Summary: Wish: Install source package from CRAN alone
 Key: ARROW-12981
 URL: https://issues.apache.org/jira/browse/ARROW-12981
 Project: Apache Arrow
  Issue Type: Wish
  Components: Packaging, R
Affects Versions: 4.0.1
 Environment: Linux
Reporter: Karl Dunkle Werner


Hello,

I would like to install {{Arrow}} on Linux using only CRAN, without downloading 
additional files from Github, Apache, or Ursa Labs. I understand this is a big 
ask, and might not be a priority for you all. Feel free to close if you feel 
that this is out of scope.

Why is a CRAN-only installation useful?
 # It's common for organizations to set up firewalls that prevent arbitrary 
downloads, but allow access to their own internal CRAN mirror.
 ** Sometimes these firewalls also allow requests to Github, but often not.
 # On a broader level, my favorite thing about R is CRAN, the CRAN maintainers, 
and their 
[policy|https://cran.r-project.org/web/packages/policies.html#Source-packages] 
that "Source packages may not contain any form of binary executable code." By 
distributing most of the Arrow code separately (either as source C++ or a 
compiled library), automated code archives and other source-based tools become 
much less useful.

Of course, {{arrow}} isn't the only R package to depend on external libraries 
or distribute code separately. If a CRAN-only approach isn't viable, it would 
still be useful to have an all-offline method. I'm also having trouble getting 
an offline install to work, even with a local copy of the Arrow repo. (See the 
bottom of the script below.)

 

What does does installing offline look like now?
 Here's a bash script that approximates installing behind a firewall.
{code:sh}
git clone --depth 1 g...@github.com:apache/arrow.git test_arrow

cd test_arrow
wget 'https://cran.r-project.org/src/contrib/arrow_4.0.1.tar.gz'

# Set up a temporary R library (optional)
mkdir test_r_lib
export R_LIBS_USER=test_r_lib

export ARROW_R_DEV=true
export LIBARROW_MINIMAL=false
export LIBARROW_DOWNLOAD=false
export LIBARROW_BINARY=false
export LIBARROW_BUILD=true

# These are all of the direct dependencies, including Suggests
# This isn't required if the packages are already installed
Rscript -e "install.packages(c('assertthat', 'bit64', 'purrr', 'R6', 'rlang', 
'tidyselect', 'vctrs', 'cpp11', 'decor', 'distro', 'dplyr', 'hms', 'knitr', 
'lubridate', 'pkgload', 'reticulate', 'rmarkdown', 'stringr', 'testthat', 
'tibble', 'withr'))"



# Disable your internet connection here.



# Now try to install the R package we downloaded with wget.
# This is an approximation of being behind a firewall.
Rscript -e 'install.packages("arrow_4.0.1.tar.gz", repos=NULL)'

# It successfully installs the R component, but not the C++ library, 
# even with LIBARROW_BUILD=true
Rscript -e "arrow::arrow_available()"
# [1] FALSE


# As mentioned in the installation vignette, 
# we can R CMD INSTALL in the git repo.

R CMD INSTALL r

# This will try to build the C++ library, but fails when mimalloc and 
# jemalloc can't be downloaded from Github.
# (Seems not to be affected by LIBARROW_DOWNLOAD=false).
# When C++ compilation fails, the R component still installs.

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12853) [R] Install fails on R 4.1.0

2021-05-22 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-12853:
--

 Summary: [R] Install fails on R 4.1.0
 Key: ARROW-12853
 URL: https://issues.apache.org/jira/browse/ARROW-12853
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 4.0.0
 Environment: R 4.1.0
Reporter: Karl Dunkle Werner


Hi,

I noticed my installation failed after updating R to the newly released 4.1.0. 
Arrow compiles without error, but installation fails with a segfault when R 
tests whether the package can be loaded.
Here are the relevant environment variables I've set:
LIBARROW_MINIMAL: {{false}}
LIBARROW_BINARY: {{ubuntu-20.04}} (also fails with {{false}})
ARROW_R_DEV: {{true}}
Let me know if there are other configurations you'd like me to test.

Related: ARROW-12824

The install log:
{noformat}
install.packages("arrow")
Installing package into ‘/home/karl/.R/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/arrow_4.0.0.1.tar.gz'
Content type 'application/x-gzip' length 426715 bytes (416 KB)
==
downloaded 416 KB

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Using ubuntu-18.04 binary for ubuntu-20.04
trying URL 
'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-18.04/arrow-4.0.0.1.zip'
Content type 'binary/octet-stream' length 16007220 bytes (15.3 MB)
==
downloaded 15.3 MB

*** Successfully retrieved C++ binaries for ubuntu-18.04
 Binary package requires libcurl and openssl
 If installation fails, retry after installing those system requirements
PKG_CFLAGS=-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include
  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3
PKG_LIBS=-L/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/lib
 -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies 
-larrow_dataset -lparquet -lssl -lcrypto -lcurl
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c array.cpp -o array.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c array_to_vector.cpp -o array_to_vector.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c arraydata.cpp -o arraydata.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c arrowExports.cpp -o arrowExports.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c buffer.cpp -o buffer.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpGwVvlP/R.INSTALL106e64cdb5ef7/arrow/libarrow/arrow-4.0.0.1/include  
-DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
-DARROW_R_WITH_S3 -I'/home/karl/.R/4.1/cpp11/include'-fpic  -g -O2 
-ffile-prefix-map=/build/r-base-aXXzqd/r-base-4.1.0=. -flto=auto 
-ffat-lto-objects 

[jira] [Created] (ARROW-10511) [Python] Timezone error in Table.to_pandas()

2020-11-06 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-10511:
--

 Summary: [Python] Timezone error in Table.to_pandas()
 Key: ARROW-10511
 URL: https://issues.apache.org/jira/browse/ARROW-10511
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: Ubuntu 20.04, Python 3.8.6, Pandas 1.1.4
Reporter: Karl Dunkle Werner


We're having an issue with timezones in the Table {{to_pandas}} methods. See 
example below.

{code:python}
import pyarrow as pa
import pandas as pd

print(pa.__version__)
# 2.0.0

df = pd.DataFrame({"time": pd.to_datetime([0, 0])})

time_field = pa.field("time",type=pa.timestamp("ms", tz="utc"), nullable=False)
schema = pa.schema([time_field])

tab = pa.Table.from_pandas(df, schema)

tab.to_pandas() 

# File ".../pandas_compat.py", line 777, in table_to_blockmanager
#   table = _add_any_metadata(table, pandas_metadata)
# File ".../pandas_compat.py", line 1184, in _add_any_metadata
#   tz = col_meta['metadata']['timezone']
# TypeError: 'NoneType' object is not subscriptable

{code}


Related issues:
https://issues.apache.org/jira/browse/ARROW-9223
https://issues.apache.org/jira/browse/ARROW-9528
https://github.com/catalyst-cooperative/pudl/issues/705



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9946) ParquetFileWriter segfaults when `sink` is a string

2020-09-08 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-9946:
-

 Summary: ParquetFileWriter segfaults when `sink` is a string
 Key: ARROW-9946
 URL: https://issues.apache.org/jira/browse/ARROW-9946
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 1.0.1
 Environment: Ubuntu 20.04
Reporter: Karl Dunkle Werner


Hello again! I have another minor R arrow issue.

 

The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a "string 
which is interpreted as a file path". However, when I try to use a string, I 
get a segfault because the memory isn't mapped.

 

Maybe this is a separate request, but it would also be helpful to have 
documentation for the methods of the writer created by 
{{ParquetFileWriter$create()}}.

Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]

 
{code:r}
library(arrow)

sch = schema(a = float32())
writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")

#> *** caught segfault ***
#> address 0x1417d, cause 'memory not mapped'
#> 
#> Traceback:
#> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
arrow_properties)
#> 2: shared_ptr_is_null(xp)
#> 3: shared_ptr(ParquetFileWriter, 
parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
arrow_properties))
#> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")


# This works as expected:
sink = FileOutputStream$create("test.parquet")
writer = ParquetFileWriter$create(schema = sch, sink = sink)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9557) [R] Iterating over parquet columns is slow in R

2020-07-25 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-9557:
-

 Summary: [R] Iterating over parquet columns is slow in R
 Key: ARROW-9557
 URL: https://issues.apache.org/jira/browse/ARROW-9557
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 1.0.0
Reporter: Karl Dunkle Werner


I've found that reading in a parquet file one column at a time is slow in R – 
much slower than reading the whole all at once in R, or reading one column at a 
time in Python.

An example is below, though it's certainly possible I've done my benchmarking 
incorrectly.

 

Python setup and benchmarking:
{code:python}
import numpy as np
import pyarrow
import pyarrow.parquet as pq
from numpy.random import default_rng
from time import time

# Create a large, random array to save. ~1.5 GB.
rng = default_rng(seed = 1)
n_col = 4000
n_row = 5

mat = rng.standard_normal((n_col, n_row))
col_names = [str(nm) for nm in range(n_col)]
tab = pyarrow.Table.from_arrays(mat, names=col_names)

pq.write_table(tab, "test_tab.parquet", use_dictionary=False)

# How long does it take to read the whole thing in python?
time_start = time()
_ = pq.read_table("test_tab.parquet")
elapsed = time() - time_start
print(elapsed) # under 1 second on my computer


time_start = time()
f = pq.ParquetFile("test_tab.pq")
for one_col in col_names:
_ = f.read(one_col).column(0)

elapsed = time() - time_start
print(elapsed) # about 2 seconds


{code}
R benchmarking, using the same {{test_tab.parquet}} file
{code:r}
library(arrow)

read_by_column <- function(f) {
table = ParquetFileReader$create(f)
cols <- as.character(0:3999)
purrr::walk(cols, ~table$ReadTable(.)$column(0))
}

bench::mark(
read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
read_by_column("test_tab.parquet"),# 100 s
check=FALSE
)

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd

2020-05-03 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098535#comment-17098535
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

Update: {{LIBARROW_BINARY=ubuntu-18.04}} seems to work with Ubuntu 20.04 too. 
(It compiles; I haven't run the tests.)


> [R] zstd symbol not found if there are multiple installations of zstd
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-27 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094007#comment-17094007
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

Update: I remembered dev packages.

I had libzstd-dev 1.4.3 installed as a dependency of libgdal-dev. After 
uninstalling it, I was able to install arrow. Logs are below.

 

 
{noformat}
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Generating code with data-raw/codegen.R
Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
Error in download.file(from_url, to_file, quiet = quietly) : 
  cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
trying URL 
'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-0.17.0/apache-arrow-0.17.0.tar.gz'
Content type 'application/x-gzip' length 6460548 bytes (6.2 MB)
==
downloaded 6.2 MB*** Successfully retrieved C++ source
*** Building C++ libraries
rm: cannot remove 'src/*.o': No such file or directory
*** Building with MAKEFLAGS=  -j4 
 arrow with 
SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp 
BUILD_DIR=/tmp/Rtmp9loTsA/file46055b57ae53 DEST_DIR=libarrow/arrow-0.17.0 
CMAKE=/usr/bin/cmake 
++ pwd
+ : /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow
+ : /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
+ : /tmp/Rtmp9loTsA/file46055b57ae53
+ : libarrow/arrow-0.17.0
+ : /usr/bin/cmake
++ cd /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
++ pwd
+ SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
++ mkdir -p libarrow/arrow-0.17.0
++ cd libarrow/arrow-0.17.0
++ pwd
+ DEST_DIR=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0
+ '[' '' = '' ']'
+ which ninja
+ CMAKE_GENERATOR=Ninja
+ '[' false = false ']'
+ ARROW_JEMALLOC=ON
+ ARROW_WITH_BROTLI=ON
+ ARROW_WITH_BZ2=ON
+ ARROW_WITH_LZ4=ON
+ ARROW_WITH_SNAPPY=ON
+ ARROW_WITH_ZLIB=ON
+ ARROW_WITH_ZSTD=ON
+ mkdir -p /tmp/Rtmp9loTsA/file46055b57ae53
+ pushd /tmp/Rtmp9loTsA/file46055b57ae53
/tmp/Rtmp9loTsA/file46055b57ae53 /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON 
-DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON 
-DOPENSSL_USE_STATIC_LIBS=ON -G Ninja 
/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/lib/ccache/cc
-- Check for working C compiler: /usr/lib/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/lib/ccache/c++
-- Check for working CXX compiler: /usr/lib/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 0.17.0 (full: '0.17.0')
-- Arrow SO version: 17 (full: 17.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/bin/python3.7 (found version "3.7.5") found components:  
Interpreter 
-- Using ccache: /usr/bin/ccache
-- Found cpplint executable at 
/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using AUTO approach to find dependencies
-- ARROW_AWSSDK_BUILD_VERSION: 1.7.160
-- ARROW_BOOST_BUILD_VERSION: 1.71.0
-- ARROW_BROTLI_BUILD_VERSION: v1.0.7
-- ARROW_BZIP2_BUILD_VERSION: 1.0.8
-- 

[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-24 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091925#comment-17091925
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

{noformat}
Installing package into ‘/home/karl/test_arrow’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cloud.r-project.org/src/contrib/arrow_0.17.0.tar.gz'
Content type 'application/x-gzip' length 242534 bytes (236 KB)
==
downloaded 236 KB* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Generating code with data-raw/codegen.R
Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
Error in download.file(from_url, to_file, quiet = quietly) : 
  cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
trying URL 
'https://www.apache.org/dyn/closer.lua?action=download=arrow/arrow-0.17.0/apache-arrow-0.17.0.tar.gz'
Content type 'application/x-gzip' length 6460548 bytes (6.2 MB)
==
downloaded 6.2 MB*** Successfully retrieved C++ source
*** Building C++ libraries
rm: cannot remove 'src/*.o': No such file or directory
*** Building with MAKEFLAGS=  -j4 
 arrow with 
SOURCE_DIR=/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp 
BUILD_DIR=/tmp/RtmptP2CaW/file476e6fba345b DEST_DIR=libarrow/arrow-0.17.0 
CMAKE=/usr/bin/cmake 
++ pwd
+ : /tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow
+ : /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp
+ : /tmp/RtmptP2CaW/file476e6fba345b
+ : libarrow/arrow-0.17.0
+ : /usr/bin/cmake
++ cd /tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp
++ pwd
+ SOURCE_DIR=/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp
++ mkdir -p libarrow/arrow-0.17.0
++ cd libarrow/arrow-0.17.0
++ pwd
+ DEST_DIR=/tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow/libarrow/arrow-0.17.0
+ '[' '' = '' ']'
+ which ninja
+ CMAKE_GENERATOR=Ninja
+ '[' false = false ']'
+ ARROW_JEMALLOC=ON
+ ARROW_WITH_BROTLI=ON
+ ARROW_WITH_BZ2=ON
+ ARROW_WITH_LZ4=ON
+ ARROW_WITH_SNAPPY=ON
+ ARROW_WITH_ZLIB=ON
+ ARROW_WITH_ZSTD=ON
+ mkdir -p /tmp/RtmptP2CaW/file476e6fba345b
+ pushd /tmp/RtmptP2CaW/file476e6fba345b
/tmp/RtmptP2CaW/file476e6fba345b /tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON 
-DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/RtmpynJFHV/R.INSTALL474739c260b7/arrow/libarrow/arrow-0.17.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON 
-DOPENSSL_USE_STATIC_LIBS=ON -G Ninja 
/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/lib/ccache/cc
-- Check for working C compiler: /usr/lib/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/lib/ccache/c++
-- Check for working CXX compiler: /usr/lib/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 0.17.0 (full: '0.17.0')
-- Arrow SO version: 17 (full: 17.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/bin/python3.7 (found version "3.7.5") found components:  
Interpreter 
-- Using ccache: /usr/bin/ccache
-- Found cpplint executable at 
/tmp/RtmptP2CaW/file476e274f73a4/apache-arrow-0.17.0/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using AUTO approach to find dependencies

[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-24 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091700#comment-17091700
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

Great!

If you want to get to the bottom of it, I would be happy to run commands you 
send me. I think most 19.10 users will be moving to 20.04 soon, so this might 
only be worth it if 20.04 experiences the same issue.

> [R] zstd symbol not found on Ubuntu 19.10
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-22 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090003#comment-17090003
 ] 

Karl Dunkle Werner edited comment on ARROW-8556 at 4/22/20, 8:24 PM:
-

Sure! The logs are pasted below.
 * Setting {{LIBARROW_BINARY=ubuntu-18.04}} and reinstalling works.
 * I did not have zstd installed. I just installed version 1.4.3.
 * After installing zstd (and resetting {{LIBARROW_BINARY=true}}), installation 
fails again.

 
{noformat}
> install.packages("arrow")

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** No C++ binaries found for ubuntu-19.10
*** Successfully retrieved C++ source
*** Building C++ libraries
 arrow  
PKG_CFLAGS=-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include
  -DARROW_R_WITH_ARROW
PKG_LIBS=-L/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/lib
 -larrow_dataset -lparquet -larrow -larrow -larrow_dataset -lbrotlidec-static 
-lbrotlienc-static -lbrotlizzz-static -ljemalloc_pic -llz4 -lparquet -lsnappy 
-lthrift -lthriftz
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array.cpp -o array.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array_from_vector.cpp -o array_from_vector.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array_to_vector.cpp -o array_to_vector.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
arraydata.cpp -o arraydata.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
arrowExports.cpp -o arrowExports.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
buffer.cpp -o buffer.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
chunkedarray.cpp -o chunkedarray.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
compression.cpp -o compression.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
compute.cpp -o compute.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 

[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-22 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090003#comment-17090003
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

Sure! The logs are pasted below.
 * Setting {{LIBARROW_BINARY=UBUNTU-18.04}} and reinstalling works.
 * I did not have zstd installed. I just installed version 1.4.3.
 * After installing zstd (and resetting {{LIBARROW_BINARY=true}}), installation 
fails again.

 
{noformat}
> install.packages("arrow")

* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** No C++ binaries found for ubuntu-19.10
*** Successfully retrieved C++ source
*** Building C++ libraries
 arrow  
PKG_CFLAGS=-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include
  -DARROW_R_WITH_ARROW
PKG_LIBS=-L/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/lib
 -larrow_dataset -lparquet -larrow -larrow -larrow_dataset -lbrotlidec-static 
-lbrotlienc-static -lbrotlizzz-static -ljemalloc_pic -llz4 -lparquet -lsnappy 
-lthrift -lthriftz
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array.cpp -o array.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array_from_vector.cpp -o array_from_vector.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
array_to_vector.cpp -o array_to_vector.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
arraydata.cpp -o arraydata.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
arrowExports.cpp -o arrowExports.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
buffer.cpp -o buffer.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
chunkedarray.cpp -o chunkedarray.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
compression.cpp -o compression.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. -fstack-protector-strong 
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c 
compute.cpp -o compute.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
-I/tmp/RtmpDyA8s0/R.INSTALL3f884be9fa7b/arrow/libarrow/arrow-0.17.0/include  
-DARROW_R_WITH_ARROW -I"/home/karl/test_arrow/Rcpp/include"   -fpic  -g -O2 
-fdebug-prefix-map=/build/r-base-k1TtL4/r-base-3.6.1=. 

[jira] [Created] (ARROW-8556) [R] Installation fails with `LIBARROW_MINIMAL=false`

2020-04-22 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-8556:
-

 Summary: [R] Installation fails with `LIBARROW_MINIMAL=false`
 Key: ARROW-8556
 URL: https://issues.apache.org/jira/browse/ARROW-8556
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.17.0
 Environment: Ubuntu 19.10
R 3.6.1
Reporter: Karl Dunkle Werner


I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
Prebuilt binaries are unavailable, and I want to enable compression, so I set 
the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
like the package is able to compile, but can't be loaded. I'm able to install 
correctly if I don't set the {{LIBARROW_MINIMAL}} variable.

Here's the error I get:
{code:java}
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
  ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
ZSTD_initCStream
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘~/.R/3.6/arrow’
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7789) [R] Unknown error when using arrow::write_feather()  in R 3.5.3

2020-02-26 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045843#comment-17045843
 ] 

Karl Dunkle Werner commented on ARROW-7789:
---

Reading hits the same issues.

> [R] Unknown error when using arrow::write_feather()  in R 3.5.3
> ---
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Minor
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7789) [R] Unknown error when using arrow::write_feather()  in R 3.5.3

2020-02-26 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045840#comment-17045840
 ] 

Karl Dunkle Werner commented on ARROW-7789:
---

I'm getting the same error when the R.oo package is loaded (not even attached). 
Here's a reprex:

 
{code:r}
loadNamespace("R.oo")
#> 
arrow::write_parquet(mtcars, tempfile())
#> Error in exists(name, envir = envir, inherits = FALSE): use of NULL 
environment is defunct


sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 19.10

#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.7.so

#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
#>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
#>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
#>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats graphics  grDevices utils datasets  methods   base 

#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.0.0  bit_1.1-15.2  compiler_3.6.1magrittr_1.5 
#>  [5] assertthat_0.2.1  R6_2.4.1  glue_1.3.1Rcpp_1.0.3   
#>  [9] bit64_0.9-7   vctrs_0.2.3   R.methodsS3_1.8.0 arrow_0.16.0.2   
#> [13] rlang_0.4.4   R.oo_1.23.0   purrr_0.3.3  

{code}

> [R] Unknown error when using arrow::write_feather()  in R 3.5.3
> ---
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Minor
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

2019-12-06 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-7345:
-

 Summary: [Python] Writing partitions with NaNs silently drops data
 Key: ARROW-7345
 URL: https://issues.apache.org/jira/browse/ARROW-7345
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
Reporter: Karl Dunkle Werner


When writing a partitioned table, if the partitioning column has NA values, 
they're silently dropped. I think it would be helpful if there was a warning. 
Even better, from my perspective, would be writing out those partitions with a 
directory name like {{partition_col=NaN}}. 

Here's a small example where only the {{b = 2}} group is written out and the 
{{b = NaN}} group is dropped.

{code:python}
import os
import tempfile
import pyarrow.json
import pyarrow.parquet
from pathlib import Path

# Create a dataset with NaN:
json_str = """
{"a": 1, "b": 2}
{"a": 2, "b": null}
"""
with tempfile.NamedTemporaryFile() as tf:
tf = Path(tf.name)
tf.write_text(json_str)
table = pyarrow.json.read_json(tf)

# Write out a partitioned dataset, using the NaN-containing column
with tempfile.TemporaryDirectory() as out_dir:
pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
print(os.listdir(out_dir))
read_table = pyarrow.parquet.read_table(out_dir)
print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")

# Output:
#> ['b=2.0']
#> Wrote out 2 rows, read back 1 row
{code}
 
It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} 
here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs

2019-11-03 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965692#comment-16965692
 ] 

Karl Dunkle Werner commented on ARROW-7035:
---

I'd be happy to write a PR! I'll have time to work on it in a few weeks.

> [R] Default arguments are unclear in write_parquet docs
> ---
>
> Key: ARROW-7035
> URL: https://issues.apache.org/jira/browse/ARROW-7035
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.15.0
> Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 
> 0.15.0.
>Reporter: Karl Dunkle Werner
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0.0
>
>
> Thank you so much for adding support for reading and writing parquet files in 
> R! I have a few questions about the user interface and optional arguments, 
> but I want to highlight how great it is to have this useful filetype to pass 
> data back and forth.
> The defaults for the optional arguments in {{arrow::write_parquet}} aren't 
> always clear. Here were my questions after reading the help docs from 
> {{write_parquet}}:
>  * What's the default {{version}}? Should a user prefer "2.0" for new 
> projects?
>  * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, 
> {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.)
>  * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least 
> some of the time.
>  * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}?
>  * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default?
> As someone who works in both R and Python, I was a little surprised when 
> pyarrow uses snappy compression by default, but R's default is uncompressed. 
> My preference would be having the same default arguments, but that might be a 
> fringe use-case.
> While I was digging into this, I was surprised that 
> {{ParquetReaderProperties}} is exported and documented, but 
> {{ParquetWriterProperties}} isn't. Is that intentional?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs

2019-10-30 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-7035:
-

 Summary: [R] Default arguments are unclear in write_parquet docs
 Key: ARROW-7035
 URL: https://issues.apache.org/jira/browse/ARROW-7035
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.0
 Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 
0.15.0.
Reporter: Karl Dunkle Werner
 Fix For: 0.15.1


Thank you so much for adding support for reading and writing parquet files in 
R! I have a few questions about the user interface and optional arguments, but 
I want to highlight how great it is to have this useful filetype to pass data 
back and forth.

The defaults for the optional arguments in {{arrow::write_parquet}} aren't 
always clear. Here were my questions after reading the help docs from 
{{write_parquet}}:
 * What's the default {{version}}? Should a user prefer "2.0" for new projects?
 * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, 
{{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.)
 * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least 
some of the time.
 * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}?
 * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default?

As someone who works in both R and Python, I was a little surprised when 
pyarrow uses snappy compression by default, but R's default is uncompressed. My 
preference would be having the same default arguments, but that might be a 
fringe use-case.

While I was digging into this, I was surprised that {{ParquetReaderProperties}} 
is exported and documented, but {{ParquetWriterProperties}} isn't. Is that 
intentional?

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-05 Thread Karl Dunkle Werner (JIRA)
Karl Dunkle Werner created ARROW-6142:
-

 Summary: [R] Install instructions on linux could be clearer
 Key: ARROW-6142
 URL: https://issues.apache.org/jira/browse/ARROW-6142
 Project: Apache Arrow
  Issue Type: Wish
  Components: R
Affects Versions: 0.14.1
 Environment: Ubuntu 19.04
Reporter: Karl Dunkle Werner
 Fix For: 0.15.0


Installing R packages on Linux is almost always from source, which means Arrow 
needs some system dependencies. The existing help message (from 
arrow::install_arrow()) is very helpful in pointing that out, but it's still a 
heavy lift for users who install R packages from source but don't plan to 
develop Arrow itself.

Here are a couple of things that could make things slightly smoother:
 # I would be very grateful if the install_arrow() message or installation page 
told me which libraries were essential to make the R package work.
 # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
instead of just "PPA" would have caused me less confusion. (Others may differ)
 # A snap package would be easier than installing a new apt address, but I 
understand that building for snap would be more packaging work and only 
benefits Ubuntu users.

 

Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-06-02 Thread Karl Dunkle Werner (JIRA)
Karl Dunkle Werner created ARROW-5480:
-

 Summary: [Python] Pandas categorical type doesn't survive a 
round-trip through parquet
 Key: ARROW-5480
 URL: https://issues.apache.org/jira/browse/ARROW-5480
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.13.0, 0.11.1
 Environment: python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0

Reporter: Karl Dunkle Werner


Writing a string categorical variable to from pandas parquet is read back as 
string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric -- a numeric category is read 
back as int64.

In the code below, I tried out an in-memory arrow Table, which successfully 
translates categories back to pandas. However, when I write to a parquet file, 
it's not.

In the scheme of things, this isn't a big deal, but it's a small surprise.


{code:python}
import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
df.dtypes  # category

# This works:
pa.Table.from_pandas(df).to_pandas().dtypes  # category

df.to_parquet("categories.parquet")
# This reads back object, but I expected category
pd.read_parquet("categories.parquet").dtypes  # object


# Numeric categories have the same issue:
df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
df_num.dtypes # category

pa.Table.from_pandas(df_num).to_pandas().dtypes  # category

df_num.to_parquet("categories_num.parquet")
# This reads back int64, but I expected category
pd.read_parquet("categories_num.parquet").dtypes  # int64
{code}







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)