mrd0ll4r commented on issue #46814: URL: https://github.com/apache/arrow/issues/46814#issuecomment-3088137273
Sorry, didn't get to do anything yesterday. I've reproduced this on the smaller dataset (2.7G, but many more parquet files): ``` > open_dataset("data/bluesky/labeler_logs_clean_parquet") %>% group_by(uri) %>% tally() FileSystemDataset (query) uri: string n: int64 See $.data for the source Arrow object > open_dataset("data/bluesky/labeler_logs_clean_parquet") %>% group_by(uri) %>% tally() %>% collect() # A tibble: 48,794,031 × 2 Thread 1 "R" received signal SIGSEGV, Segmentation fault. std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > (__first=__first@entry=0x7ffd08477d26 <error: Cannot access memory at address 0x7ffd08477d26>, __last=__last@entry=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>, __pred=__pred@entry=...) at /usr/include/c++/12/bits/predefined_ops.h:269 269 operator()(_Iterator __it) (gdb) bt #0 std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > (__first=__first@entry=0x7ffd08477d26 <error: Cannot access memory at address 0x7ffd08477d26>, __last=__last@entry=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>, __pred=__pred@entry=...) at /usr/include/c++/12/bits/predefined_ops.h:269 #1 0x00007fffeae3edd7 in std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > (__pred=..., __last=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>, __first=0x7ffd08477d26 <error: Cannot access memory at address 0x7ffd08477d26>) at /usr/include/c++/12/bits/stl_algobase.h:2112 #2 std::find<char const*, char> (__val=@0x7fffffff4f9f: 0 '\000', __last=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>, __first=0x7ffd08477d26 <error: Cannot access memory at address 0x7ffd08477d26>) at /usr/include/c++/12/bits/stl_algo.h:3851 #3 arrow::r::altrep::(anonymous namespace)::AltrepVectorString<arrow::StringType>::RStringViewer::Convert (i=0, this=0x7fffed4b7940 <arrow::r::altrep::(anonymous namespace)::AltrepVectorString<arrow::StringType>::string_viewer()::string_viewer>) at altrep.cpp:808 #4 arrow::r::altrep::(anonymous namespace)::AltrepVectorString<arrow::StringType>::Materialize (alt=0x555577947eb8) at altrep.cpp:938 #5 0x00007fffeae3f089 in arrow::r::altrep::(anonymous namespace)::AltrepVectorString<arrow::StringType>::Dataptr (alt=<optimized out>, writeable=<optimized out>) at altrep.cpp:917 #6 0x00007ffff7a80aac in ALTVEC_DATAPTR () from /usr/lib/R/lib/libR.so #7 0x00007fffee90f3e8 in r_chr_cbegin (x=<optimized out>) at ./rlang/vec.h:50 #8 chr_slice (x=x@entry=0x555577947eb8, subscript=subscript@entry=0x5555775b80d8, materialize=VCTRS_MATERIALIZE_false) at slice.c:161 #9 0x00007fffee90f941 in vec_slice_base (type=type@entry=VCTRS_TYPE_character, x=x@entry=0x555577947eb8, subscript=subscript@entry=0x5555775b80d8, materialize=materialize@entry=VCTRS_MATERIALIZE_false) at slice.c:255 #10 0x00007fffee91068b in vec_slice_unsafe (x=x@entry=0x555577947eb8, subscript=subscript@entry=0x5555775b80d8) at slice.c:345 #11 0x00007fffee91059b in df_slice (subscript=0x5555775b80d8, x=0x55557809ca88) at slice.c:194 #12 vec_slice_unsafe (x=x@entry=0x55557809ca88, subscript=0x5555775b80d8) at slice.c:359 #13 0x00007fffee9109cc in vec_slice_opts (x=0x55557809ca88, i=<optimized out>, opts=opts@entry=0x7fffffff51d0) at slice.c:421 #14 0x00007fffee910a38 in ffi_slice (x=<optimized out>, i=<optimized out>, frame=<optimized out>) at slice.c:406 #15 0x00007ffff7b02b1e in ?? () from /usr/lib/R/lib/libR.so #16 0x00007ffff7b48c24 in ?? () from /usr/lib/R/lib/libR.so #17 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so #18 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so #19 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so #20 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so #21 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so #22 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so #23 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so #24 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so #25 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so #26 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so #27 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so #28 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so #29 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so #30 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so #31 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so #32 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so #33 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so #34 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so #35 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so #36 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so #37 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so #38 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so #39 0x00007ffff7b508ce in Rf_eval () from /usr/lib/R/lib/libR.so #40 0x00007ffff7bc3a28 in ?? () from /usr/lib/R/lib/libR.so #41 0x00007ffff7bc7fe2 in ?? () from /usr/lib/R/lib/libR.so #42 0x00007ffff7b8a675 in ?? () from /usr/lib/R/lib/libR.so #43 0x00007ffff7b8a8d0 in ?? () from /usr/lib/R/lib/libR.so #44 0x00007ffff7b8a988 in run_Rmainloop () from /usr/lib/R/lib/libR.so #45 0x000055555555507b in main () #46 0x00007ffff784624a in __libc_start_call_main (main=main@entry=0x555555555060 <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe1c8) at ../sysdeps/nptl/libc_start_call_main.h:58 #47 0x00007ffff7846305 in __libc_start_main_impl (main=0x555555555060 <main>, argc=1, argv=0x7fffffffe1c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe1b8) at ../csu/libc-start.c:360 #48 0x00005555555550b1 in _start () ``` does this one also work for you? If so, I can upload the dataset to a server of mine and send you a link. I didn't have any time to see if I can reduce the size of this or pack it into a single parquet file and have it work still, sorry :( I tried messaging you on bsky, but that didn't work :( Let me know how I can get the link to you and I'll send it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org