Lucas Mation created ARROW-18372: ------------------------------------ Summary: [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell Key: ARROW-18372 URL: https://issues.apache.org/jira/browse/ARROW-18372 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0 Reporter: Lucas Mation
I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset. Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird ``` fa <- 'myparteq folder' #huge va <- open_dataset(fa) tic() d <- va %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect toc() Error in `collect()`: ! Invalid: negative malloc size Run `rlang::last_error()` to see where the error occurred. > rlang::last_error() <error/rlang_error> Error in `collect()`: ! Invalid: negative malloc size --- Backtrace: 1. ... %>% collect 3. arrow:::collect.arrow_dplyr_query(.) Run `rlang::last_trace()` to see the full context. > rlang::last_trace() <error/rlang_error> Error in `collect()`: ! Invalid: negative malloc size --- Backtrace: x 1. +-... %>% collect 2. +-dplyr::collect(.) 3. \-arrow:::collect.arrow_dplyr_query(.) 4. \-base::tryCatch(...) 5. \-base (local) tryCatchList(expr, classes, parentenv, handlers) 6. \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) 7. \-value[[3L]](cond) 8. \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema) 9. \-rlang::abort(msg, call = call) ``` I am running this on a windows server, 512Gb of RAM. sessionInfo() R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server 2012 R2 x64 (build 9600) Matrix products: default locale: [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C [5] LC_TIME=Portuguese_Brazil.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_10.0.0 data.table_1.14.4 forcats_0.5.2 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 [9] ggplot2_3.3.6 tidyverse_1.3.2 gt_0.7.0 xtable_1.8-4 ggthemes_4.2.4 collapse_1.8.6 pryr_0.1.5 janitor_2.1.0 [17] tictoc_1.1 lubridate_1.8.0 stringr_1.4.1 readxl_1.4.1 loaded via a namespace (and not attached): [1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1 [8] reprex_2.0.2 httr_1.4.4 pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14 googledrive_2.0.0 [15] bit_4.0.4 munsell_0.5.0 broom_1.0.1 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3 htmltools_0.5.3 [22] tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0 [29] grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 [36] cli_3.4.1 stringi_1.7.8 fs_1.5.2 snakecase_0.11.0 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3 [43] vctrs_0.5.0 tools_4.2.1 bit64_4.0.5 glue_1.6.2 hms_1.1.2 parallel_4.2.1 fastmap_1.1.0 [50] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3 haven_2.5.1 arrow_info() Arrow package version: 10.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 74.82 Gb Max 97.75 Gb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 10.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0 -- This message was sent by Atlassian Jira (v8.20.10#820010)