Nelson Areal created ARROW-15201:
Summary: Problem counting number of records of a parquet dataset
created using Spark
Key: ARROW-15201
URL: https://issues.apache.org/jira/browse/ARROW-15201
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 6.0.1
Reporter: Nelson Areal
When I open a dataset of parquet files created by Spark I cannot get a count of
the number of records, the process hangs with 100% CPU usage.
If I use DuckDB (to_duckdb) to perform the count, the operation completes as
expected.
The example below reproduces the problem:
{code:r}
library(tidyverse) # v 1.3.1
library(arrow) # v 6.0.1
library(duckdb) # v 0.3.1-1
library(sparklyr) # v 1.7.3
# Using Spark: 3.0.0, but the same occurs when using Spark 2.4
sc <- spark_connect(master = "local")
# Create a simple data frame and save it to parquet using Spark
test_df <- tibble(a = 1:10e6)
test_spark_tbl <- copy_to(sc, test_df)
spark_write_parquet(test_spark_tbl, path="test")
test_arrow_ds <- open_dataset(sources = "test")
# This works as expected
system.time(
test_arrow_ds %>%
to_duckdb() %>%
count()
)
# user system elapsed
# 0.039 0.040 0.065
# The following will hang the process with 100% CPU usage
test_arrow_ds %>%
count() %>%
collect()
{code}
The session information:
{noformat}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.1
Matrix products: default
LAPACK:
/Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sparklyr_1.7.3 duckdb_0.3.1-1 DBI_1.1.2 arrow_6.0.1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[9] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6ggplot2_3.3.5
[13] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7lubridate_1.8.0 forge_0.2.0 rprojroot_2.0.2
[5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2R6_2.5.1
[9] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.14
[13] httr_1.4.2pillar_1.6.4 rlang_0.4.12 readxl_1.3.1
[17] rstudioapi_0.13 blob_1.2.2rmarkdown_2.11htmlwidgets_1.5.4
[21] r2d3_0.2.5bit_4.0.4 munsell_0.5.0 broom_0.7.10
[25] compiler_4.1.2modelr_0.1.8 xfun_0.29 pkgconfig_2.0.3
[29] base64enc_0.1-3 htmltools_0.5.2 tidyselect_1.1.1 fansi_0.5.0
[33] crayon_1.4.2 tzdb_0.2.0dbplyr_2.1.1 withr_2.4.3
[37] grid_4.1.2jsonlite_1.7.2gtable_0.3.0 lifecycle_1.0.1
[41] magrittr_2.0.1scales_1.1.1 cli_3.1.0 stringi_1.7.6
[45] fs_1.5.2 xml2_1.3.3ellipsis_0.3.2generics_0.1.1
[49] vctrs_0.3.8 tools_4.1.2 bit64_4.0.5 glue_1.6.0
[53] hms_1.1.1 fastmap_1.1.0 yaml_2.2.1colorspace_2.0-2
[57] rvest_1.0.2 knitr_1.37haven_2.4.3
{noformat}
I can also reproduce this in on Linux machine.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)