mrd0ll4r commented on issue #46814:
URL: https://github.com/apache/arrow/issues/46814#issuecomment-3088137273

   Sorry, didn't get to do anything yesterday.
   I've reproduced this on the smaller dataset (2.7G, but many more parquet 
files):
   ```
   > open_dataset("data/bluesky/labeler_logs_clean_parquet") %>% group_by(uri) 
%>% tally()
   FileSystemDataset (query)
   uri: string
   n: int64
   
   See $.data for the source Arrow object
   > open_dataset("data/bluesky/labeler_logs_clean_parquet") %>% group_by(uri) 
%>% tally() %>% collect()
   # A tibble: 48,794,031 × 2
   
   Thread 1 "R" received signal SIGSEGV, Segmentation fault.
   std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > 
(__first=__first@entry=0x7ffd08477d26 <error: Cannot access memory at address 
0x7ffd08477d26>,
       __last=__last@entry=0x7ffd08477d6c <error: Cannot access memory at 
address 0x7ffd08477d6c>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   269             operator()(_Iterator __it)
   (gdb) bt
   #0  std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char 
const> > (__first=__first@entry=0x7ffd08477d26 <error: Cannot access memory at 
address 0x7ffd08477d26>,
       __last=__last@entry=0x7ffd08477d6c <error: Cannot access memory at 
address 0x7ffd08477d6c>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   #1  0x00007fffeae3edd7 in std::__find_if<char const*, 
__gnu_cxx::__ops::_Iter_equals_val<char const> > (__pred=..., 
__last=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>,
       __first=0x7ffd08477d26 <error: Cannot access memory at address 
0x7ffd08477d26>) at /usr/include/c++/12/bits/stl_algobase.h:2112
   #2  std::find<char const*, char> (__val=@0x7fffffff4f9f: 0 '\000', 
__last=0x7ffd08477d6c <error: Cannot access memory at address 0x7ffd08477d6c>,
       __first=0x7ffd08477d26 <error: Cannot access memory at address 
0x7ffd08477d26>) at /usr/include/c++/12/bits/stl_algo.h:3851
   #3  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::RStringViewer::Convert (i=0,
       this=0x7fffed4b7940 <arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::string_viewer()::string_viewer>)
 at altrep.cpp:808
   #4  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Materialize 
(alt=0x555577947eb8) at altrep.cpp:938
   #5  0x00007fffeae3f089 in arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Dataptr (alt=<optimized 
out>, writeable=<optimized out>) at altrep.cpp:917
   #6  0x00007ffff7a80aac in ALTVEC_DATAPTR () from /usr/lib/R/lib/libR.so
   #7  0x00007fffee90f3e8 in r_chr_cbegin (x=<optimized out>) at 
./rlang/vec.h:50
   #8  chr_slice (x=x@entry=0x555577947eb8, 
subscript=subscript@entry=0x5555775b80d8, materialize=VCTRS_MATERIALIZE_false) 
at slice.c:161
   #9  0x00007fffee90f941 in vec_slice_base 
(type=type@entry=VCTRS_TYPE_character, x=x@entry=0x555577947eb8, 
subscript=subscript@entry=0x5555775b80d8, 
materialize=materialize@entry=VCTRS_MATERIALIZE_false)
       at slice.c:255
   #10 0x00007fffee91068b in vec_slice_unsafe (x=x@entry=0x555577947eb8, 
subscript=subscript@entry=0x5555775b80d8) at slice.c:345
   #11 0x00007fffee91059b in df_slice (subscript=0x5555775b80d8, 
x=0x55557809ca88) at slice.c:194
   #12 vec_slice_unsafe (x=x@entry=0x55557809ca88, subscript=0x5555775b80d8) at 
slice.c:359
   #13 0x00007fffee9109cc in vec_slice_opts (x=0x55557809ca88, i=<optimized 
out>, opts=opts@entry=0x7fffffff51d0) at slice.c:421
   #14 0x00007fffee910a38 in ffi_slice (x=<optimized out>, i=<optimized out>, 
frame=<optimized out>) at slice.c:406
   #15 0x00007ffff7b02b1e in ?? () from /usr/lib/R/lib/libR.so
   #16 0x00007ffff7b48c24 in ?? () from /usr/lib/R/lib/libR.so
   #17 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #18 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #19 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #20 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #21 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #22 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #23 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #24 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #25 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #26 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #27 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #28 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #29 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #30 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #31 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #32 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #33 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #34 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #35 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #36 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #37 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #38 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #39 0x00007ffff7b508ce in Rf_eval () from /usr/lib/R/lib/libR.so
   #40 0x00007ffff7bc3a28 in ?? () from /usr/lib/R/lib/libR.so
   #41 0x00007ffff7bc7fe2 in ?? () from /usr/lib/R/lib/libR.so
   #42 0x00007ffff7b8a675 in ?? () from /usr/lib/R/lib/libR.so
   #43 0x00007ffff7b8a8d0 in ?? () from /usr/lib/R/lib/libR.so
   #44 0x00007ffff7b8a988 in run_Rmainloop () from /usr/lib/R/lib/libR.so
   #45 0x000055555555507b in main ()
   #46 0x00007ffff784624a in __libc_start_call_main 
(main=main@entry=0x555555555060 <main>, argc=argc@entry=1, 
argv=argv@entry=0x7fffffffe1c8) at ../sysdeps/nptl/libc_start_call_main.h:58
   #47 0x00007ffff7846305 in __libc_start_main_impl (main=0x555555555060 
<main>, argc=1, argv=0x7fffffffe1c8, init=<optimized out>, fini=<optimized 
out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe1b8)
       at ../csu/libc-start.c:360
   #48 0x00005555555550b1 in _start ()
   ```
   
   does this one also work for you? If so, I can upload the dataset to a server 
of mine and send you a link. I didn't have any time to see if I can reduce the 
size of this or pack it into a single parquet file and have it work still, 
sorry :(
   I tried messaging you on bsky, but that didn't work :( Let me know how I can 
get the link to you and I'll send it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to