Lucas Mation created ARROW-18372:
------------------------------------

             Summary: [R] "Error in `collect()`: ! Invalid: negative malloc 
size" after large computation returning one cell
                 Key: ARROW-18372
                 URL: https://issues.apache.org/jira/browse/ARROW-18372
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 10.0.0
            Reporter: Lucas Mation


I have a large parquet file 900 million rows , 40cols parquet file, subdivided 
into folders for each year. I was trying to calculate how many unique 
combinations of id1+id2+id3+id4 there are in the dataset.

 

Notice that the "collected" dataset is supposed to be only one row and one cel, 
containing the count (I've confirmed this by subseting the dataset ("%>% 
head(10^6)" ) before computing the count, and it works). That is why the error 
below is so weird

```

fa <- 'myparteq folder' #huge 

va <- open_dataset(fa)

tic()
d <- va  %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect

toc()

 

Error in `collect()`:
! Invalid: negative malloc size
Run `rlang::last_error()` to see where the error occurred.

 

> rlang::last_error()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
 1. ... %>% collect
 3. arrow:::collect.arrow_dplyr_query(.)
Run `rlang::last_trace()` to see the full context.

 

> rlang::last_trace()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
    x
 1. +-... %>% collect
 2. +-dplyr::collect(.)
 3. \-arrow:::collect.arrow_dplyr_query(.)
 4.   \-base::tryCatch(...)
 5.     \-base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.       \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.         \-value[[3L]](cond)
 8.           \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
 9.             \-rlang::abort(msg, call = call)

 

```

I am running this on a windows server, 512Gb of RAM.

 sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    
LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_10.0.0      data.table_1.14.4 forcats_0.5.2     dplyr_1.0.10      
purrr_0.3.5  readr_2.1.3       tidyr_1.2.1       tibble_3.1.8     
 [9] ggplot2_3.3.6     tidyverse_1.3.2   gt_0.7.0          xtable_1.8-4      
ggthemes_4.2.4    collapse_1.8.6    pryr_0.1.5        janitor_2.1.0    
[17] tictoc_1.1        lubridate_1.8.0   stringr_1.4.1     readxl_1.4.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          assertthat_0.2.1    digest_0.6.30       utf8_1.2.2     
     R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
 [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6    
     googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
[15] bit_4.0.4           munsell_0.5.0       broom_1.0.1         compiler_4.2.1 
     modelr_0.1.9        pkgconfig_2.0.3     htmltools_0.5.3    
[22] tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3         crayon_1.5.2   
     tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
[29] grid_4.2.1          jsonlite_1.8.3      gtable_0.3.1        
lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
[36] cli_3.4.1           stringi_1.7.8       fs_1.5.2            
snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3     
[43] vctrs_0.5.0         tools_4.2.1         bit64_4.0.5         glue_1.6.2     
     hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
[50] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3         haven_2.5.1    

 

 arrow_info()
Arrow package version: 10.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current   74.82 Gb
Max       97.75 Gb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    10.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to