mrd0ll4r opened a new issue, #46814:
URL: https://github.com/apache/arrow/issues/46814
### Describe the bug, including details regarding any error messages,
version, and platform.
Hello, it's me again with big-data R segfaults :)
I have a dataset of approx 8GB, hive-partitioned, `8537` parquet files. I
can probably share the dataset.
I'm executing this query:
```r
open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
group_by(uri) %>%
tally() %>%
filter(n==1) %>%
tally() %>%
collect()
```
which throws:
```
*** caught segfault ***
address 0x7f0634a5e2e8, cause 'memory not mapped'
*** caught segfault ***
address 0x7f063441a2d5, cause 'memory not mapped'
Traceback:
1: Table__from_ExecPlanReader(self)
2: x$read_table()
3: as_arrow_table.RecordBatchReader(reader)
4: as_arrow_table(reader)
5: as_arrow_table.arrow_dplyr_query(x)
6: as_arrow_table(x)
7: doTryCatch(return(expr), name, parentenv, handler)
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
9: tryCatchList(expr, classes, parentenv, handlers)Segmentation fault
```
Unfortunately, I didn't get a core dump this time, no clue why.
Another query got as far as computing the number of rows and columns, but
also segfaulted:
```r
open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
group_by(uri) %>%
tally() %>%
collect()
```
... gets as far as this:
```
# A tibble: 62,642,379 × 2
```
and segfaults like so:
```
*** caught segfault ***
address 0x7ff004949d34, cause 'memory not mapped'
Traceback:
1: vec_slice(x, seq_len(n))
2: vec_head(as.data.frame(x), n)
3: df_head(x, n)
4: tbl_format_setup.tbl(x, width, ..., setup = setup, n = n, max_extra_cols
= max_extra_cols, max_footer_lines = max_footer_lines, focus = focus)
5: tbl_format_setup_dispatch(x, width, ..., setup = setup, n = n,
max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, focus
= focus)
6: tbl_format_setup(x, width = width, ..., setup = setup, n = n,
max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, focus
= attr(x, "pillar_focus"))
7: format_tbl(x, width = width, ..., n = n, max_extra_cols =
max_extra_cols, max_footer_lines = max_footer_lines, transform = writeLines)
8: print_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,
max_footer_lines = max_footer_lines)
9: print.tbl(x)
10: (function (x, ...) UseMethod("print"))(x)
```
The second crash is less surprising, as that's a giant tibble and R probably
doesn't like it.
But the first query is essentially a scalar, so that should be fine.
The parquet files were originally produced by DuckDB.
This is the format:
```r
> open_dataset("data/bluesky/labeler_logs_dirty_parquet")
FileSystemDataset with 8537 Parquet files
12 columns
dom: int64
seq: int64
ts: timestamp[us, tz=UTC]
src: string
neg: bool
val: string
uri: string
cid: string
ver: int64
labeler_host: string
year: int32
month: int32
```
#### Additional Info
Machine overview:
```
Memory: 378 GB
CPU: 64x Intel(R) Xeon(R) Gold 6154
OS: Debian 12
```
R `sessionInfo()`:
```r
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version
3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] paletteer_1.6.0 ggplot2_3.5.2 viridis_0.6.5 viridisLite_0.4.2
[5] pracma_2.4.4 xtable_1.8-4 forcats_1.0.0 readr_2.1.5
[9] arrow_20.0.0 tidyr_1.3.1 stringr_1.5.1 lubridate_1.9.4
[13] dplyr_1.1.4
loaded via a namespace (and not attached):
[1] bit_4.6.0 gtable_0.3.6 rematch2_2.1.2 compiler_4.5.0
[5] renv_1.0.3 tidyselect_1.2.1 parallel_4.5.0
assertthat_0.2.1
[9] gridExtra_2.3 scales_1.4.0 R6_2.6.1 generics_0.1.4
[13] tibble_3.3.0 RColorBrewer_1.1-3 pillar_1.10.2 tzdb_0.5.0
[17] rlang_1.1.6 stringi_1.8.7 bit64_4.6.0-1
timechange_0.3.0
[21] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 grid_4.5.0
[25] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 glue_1.8.0
[29] farver_2.1.2 purrr_1.0.4 tools_4.5.0 pkgconfig_2.0.3
```
`lsb_release -a`:
```
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 12 (bookworm)
Release: 12
Codename: bookworm
```
`uname -a`:
```
Linux <redacted> 6.1.0-34-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.135-1
(2025-04-25) x86_64 GNU/Linux
```
`cat /proc/cpuinfo` (truncated):
```
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
stepping : 4
microcode : 0x2007108
cpu MHz : 2992.968
cache size : 16384 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 64
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni
pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2
erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd
avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear
flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only
ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid
unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
mds swapgs taa mmio_stale_data retbleed gds bhi ibpb_no_ret
bogomips : 5985.93
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
```
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]