mrd0ll4r commented on issue #46814:
URL: https://github.com/apache/arrow/issues/46814#issuecomment-3078813679

   I've tried to roughly bisect the dataset.
    The original has `8537` parquet file and crashes
   I then randomly remove approximately half of the folders (so as to keep 
complete datasets per partition) like so:
   ```
   find labeler_logs_dirty_parquet_experiment/ -mindepth 3 -type d  | shuf | 
head -n 800 | xargs rm -r
   ```
   (the original dataset has about 1600 directories)
   
   `800` directories removed, `4463` parquet files: 
   ```
   [... lots of threads ...]
   # A tibble: 32,729,473 × 2
      uri                                                                       
 n
      <chr>                                                                  
<int>
   [... valid data ...]
   # ℹ 32,729,463 more rows
   # ℹ Use `print(n = ...)` to see more rows
   
   (no crash)
   ```
   
   I restored the dataset to full and removed `400` directories this time.
   `6565` parquet files:
   ```
   [... lots of threads ...]
   # A tibble: 45,518,144 × 2
   
   Thread 1 "R" received signal SIGSEGV, Segmentation fault.
   std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > 
(__first=__first@entry=0x7ffd6c020030 <error: Cannot access memory at address 
0x7ffd6c020030>,
       __last=__last@entry=0x7ffd6c020076 <error: Cannot access memory at 
address 0x7ffd6c020076>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   269             operator()(_Iterator __it)
   
   (gdb) bt
   #0  std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char 
const> > (__first=__first@entry=0x7ffd6c020030 <error: Cannot access memory at 
address 0x7ffd6c020030>,
       __last=__last@entry=0x7ffd6c020076 <error: Cannot access memory at 
address 0x7ffd6c020076>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   #1  0x00007fffeae3edd7 in std::__find_if<char const*, 
__gnu_cxx::__ops::_Iter_equals_val<char const> > (__pred=..., 
__last=0x7ffd6c020076 <error: Cannot access memory at address 0x7ffd6c020076>,
       __first=0x7ffd6c020030 <error: Cannot access memory at address 
0x7ffd6c020030>) at /usr/include/c++/12/bits/stl_algobase.h:2112
   #2  std::find<char const*, char> (__val=@0x7fffffff4f9f: 0 '\000', 
__last=0x7ffd6c020076 <error: Cannot access memory at address 0x7ffd6c020076>,
       __first=0x7ffd6c020030 <error: Cannot access memory at address 
0x7ffd6c020030>) at /usr/include/c++/12/bits/stl_algo.h:3851
   #3  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::RStringViewer::Convert 
(i=17940,
       this=0x7fffed4b7940 <arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::string_viewer()::string_viewer>)
 at altrep.cpp:808
   #4  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Materialize 
(alt=0x5555584a4c98) at altrep.cpp:938
   #5  0x00007fffeae3f089 in arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Dataptr (alt=<optimized 
out>, writeable=<optimized out>) at altrep.cpp:917
   #6  0x00007ffff7a80aac in ALTVEC_DATAPTR () from /usr/lib/R/lib/libR.so
   #7  0x00007fffee90f3e8 in r_chr_cbegin (x=<optimized out>) at 
./rlang/vec.h:50
   #8  chr_slice (x=x@entry=0x5555584a4c98, 
subscript=subscript@entry=0x5555602ee428, materialize=VCTRS_MATERIALIZE_false) 
at slice.c:161
   #9  0x00007fffee90f941 in vec_slice_base 
(type=type@entry=VCTRS_TYPE_character, x=x@entry=0x5555584a4c98, 
subscript=subscript@entry=0x5555602ee428, 
materialize=materialize@entry=VCTRS_MATERIALIZE_false)
       at slice.c:255
   #10 0x00007fffee91068b in vec_slice_unsafe (x=x@entry=0x5555584a4c98, 
subscript=subscript@entry=0x5555602ee428) at slice.c:345
   #11 0x00007fffee91059b in df_slice (subscript=0x5555602ee428, 
x=0x5555636550e8) at slice.c:194
   #12 vec_slice_unsafe (x=x@entry=0x5555636550e8, subscript=0x5555602ee428) at 
slice.c:359
   #13 0x00007fffee9109cc in vec_slice_opts (x=0x5555636550e8, i=<optimized 
out>, opts=opts@entry=0x7fffffff51d0) at slice.c:421
   #14 0x00007fffee910a38 in ffi_slice (x=<optimized out>, i=<optimized out>, 
frame=<optimized out>) at slice.c:406
   #15 0x00007ffff7b02b1e in ?? () from /usr/lib/R/lib/libR.so
   #16 0x00007ffff7b48c24 in ?? () from /usr/lib/R/lib/libR.so
   #17 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #18 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #19 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #20 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #21 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #22 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #23 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #24 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #25 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #26 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #27 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #28 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #29 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #30 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #31 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #32 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #33 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #34 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #35 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #36 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #37 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #38 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #39 0x00007ffff7b508ce in Rf_eval () from /usr/lib/R/lib/libR.so
   #40 0x00007ffff7bc3a28 in ?? () from /usr/lib/R/lib/libR.so
   #41 0x00007ffff7bc7fe2 in ?? () from /usr/lib/R/lib/libR.so
   #42 0x00007ffff7b8a675 in ?? () from /usr/lib/R/lib/libR.so
   #43 0x00007ffff7b8a8d0 in ?? () from /usr/lib/R/lib/libR.so
   #44 0x00007ffff7b8a988 in run_Rmainloop () from /usr/lib/R/lib/libR.so
   #45 0x000055555555507b in main ()
   #46 0x00007ffff784624a in __libc_start_call_main 
(main=main@entry=0x555555555060 <main>, argc=argc@entry=1, 
argv=argv@entry=0x7fffffffe1c8) at ../sysdeps/nptl/libc_start_call_main.h:58
   #47 0x00007ffff7846305 in __libc_start_main_impl (main=0x555555555060 
<main>, argc=1, argv=0x7fffffffe1c8, init=<optimized out>, fini=<optimized 
out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe1b8)
       at ../csu/libc-start.c:360
   #48 0x00005555555550b1 in _start ()
   ```
   
   
   I restored the dataset to full and removed `600` directories this time.
   `5457` parquet files:
   ```
   [... lots of threads ...]
   # A tibble: 39,279,436 × 2
   
   Thread 1 "R" received signal SIGSEGV, Segmentation fault.
   std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char const> > 
(__first=__first@entry=0x7ffd886760ea <error: Cannot access memory at address 
0x7ffd886760ea>,
       __last=__last@entry=0x7ffd88676130 <error: Cannot access memory at 
address 0x7ffd88676130>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   269             operator()(_Iterator __it)
   (gdb) bt
   #0  std::__find_if<char const*, __gnu_cxx::__ops::_Iter_equals_val<char 
const> > (__first=__first@entry=0x7ffd886760ea <error: Cannot access memory at 
address 0x7ffd886760ea>,
       __last=__last@entry=0x7ffd88676130 <error: Cannot access memory at 
address 0x7ffd88676130>, __pred=__pred@entry=...) at 
/usr/include/c++/12/bits/predefined_ops.h:269
   #1  0x00007fffeae3edd7 in std::__find_if<char const*, 
__gnu_cxx::__ops::_Iter_equals_val<char const> > (__pred=..., 
__last=0x7ffd88676130 <error: Cannot access memory at address 0x7ffd88676130>,
       __first=0x7ffd886760ea <error: Cannot access memory at address 
0x7ffd886760ea>) at /usr/include/c++/12/bits/stl_algobase.h:2112
   #2  std::find<char const*, char> (__val=@0x7fffffff4f9f: 0 '\000', 
__last=0x7ffd88676130 <error: Cannot access memory at address 0x7ffd88676130>,
       __first=0x7ffd886760ea <error: Cannot access memory at address 
0x7ffd886760ea>) at /usr/include/c++/12/bits/stl_algo.h:3851
   #3  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::RStringViewer::Convert (i=0,
       this=0x7fffed4b7940 <arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::string_viewer()::string_viewer>)
 at altrep.cpp:808
   #4  arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Materialize 
(alt=0x555557418f38) at altrep.cpp:938
   #5  0x00007fffeae3f089 in arrow::r::altrep::(anonymous 
namespace)::AltrepVectorString<arrow::StringType>::Dataptr (alt=<optimized 
out>, writeable=<optimized out>) at altrep.cpp:917
   #6  0x00007ffff7a80aac in ALTVEC_DATAPTR () from /usr/lib/R/lib/libR.so
   #7  0x00007fffee90f3e8 in r_chr_cbegin (x=<optimized out>) at 
./rlang/vec.h:50
   #8  chr_slice (x=x@entry=0x555557418f38, 
subscript=subscript@entry=0x55555ff4c308, materialize=VCTRS_MATERIALIZE_false) 
at slice.c:161
   #9  0x00007fffee90f941 in vec_slice_base 
(type=type@entry=VCTRS_TYPE_character, x=x@entry=0x555557418f38, 
subscript=subscript@entry=0x55555ff4c308, 
materialize=materialize@entry=VCTRS_MATERIALIZE_false)
       at slice.c:255
   #10 0x00007fffee91068b in vec_slice_unsafe (x=x@entry=0x555557418f38, 
subscript=subscript@entry=0x55555ff4c308) at slice.c:345
   #11 0x00007fffee91059b in df_slice (subscript=0x55555ff4c308, 
x=0x555562f38428) at slice.c:194
   #12 vec_slice_unsafe (x=x@entry=0x555562f38428, subscript=0x55555ff4c308) at 
slice.c:359
   #13 0x00007fffee9109cc in vec_slice_opts (x=0x555562f38428, i=<optimized 
out>, opts=opts@entry=0x7fffffff51d0) at slice.c:421
   #14 0x00007fffee910a38 in ffi_slice (x=<optimized out>, i=<optimized out>, 
frame=<optimized out>) at slice.c:406
   #15 0x00007ffff7b02b1e in ?? () from /usr/lib/R/lib/libR.so
   #16 0x00007ffff7b48c24 in ?? () from /usr/lib/R/lib/libR.so
   #17 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #18 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #19 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #20 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #21 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #22 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #23 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #24 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #25 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #26 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #27 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #28 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #29 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #30 0x00007ffff7b54139 in ?? () from /usr/lib/R/lib/libR.so
   #31 0x00007ffff7b9f4e0 in ?? () from /usr/lib/R/lib/libR.so
   #32 0x00007ffff7b9fb5b in ?? () from /usr/lib/R/lib/libR.so
   #33 0x00007ffff7b9fe1f in ?? () from /usr/lib/R/lib/libR.so
   #34 0x00007ffff7b43214 in ?? () from /usr/lib/R/lib/libR.so
   #35 0x00007ffff7b503ca in ?? () from /usr/lib/R/lib/libR.so
   #36 0x00007ffff7b50793 in Rf_eval () from /usr/lib/R/lib/libR.so
   #37 0x00007ffff7b529bf in ?? () from /usr/lib/R/lib/libR.so
   #38 0x00007ffff7b53717 in ?? () from /usr/lib/R/lib/libR.so
   #39 0x00007ffff7b508ce in Rf_eval () from /usr/lib/R/lib/libR.so
   #40 0x00007ffff7bc3a28 in ?? () from /usr/lib/R/lib/libR.so
   #41 0x00007ffff7bc7fe2 in ?? () from /usr/lib/R/lib/libR.so
   #42 0x00007ffff7b8a675 in ?? () from /usr/lib/R/lib/libR.so
   #43 0x00007ffff7b8a8d0 in ?? () from /usr/lib/R/lib/libR.so
   #44 0x00007ffff7b8a988 in run_Rmainloop () from /usr/lib/R/lib/libR.so
   #45 0x000055555555507b in main ()
   #46 0x00007ffff784624a in __libc_start_call_main 
(main=main@entry=0x555555555060 <main>, argc=argc@entry=1, 
argv=argv@entry=0x7fffffffe1c8) at ../sysdeps/nptl/libc_start_call_main.h:58
   #47 0x00007ffff7846305 in __libc_start_main_impl (main=0x555555555060 
<main>, argc=1, argv=0x7fffffffe1c8, init=<optimized out>, fini=<optimized 
out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe1b8)
       at ../csu/libc-start.c:360
   #48 0x00005555555550b1 in _start ()
   ```
   
   
   So it crashes at some point. However, I'm not sure if it's just the size of 
the dataset or if its some parts of the dataset it doesn't like in particular? 
I don't have an easy way to get the diff between the two experiments as I 
removed directories randomly each time, sorry.
   Let me know what you think. I could try digging and remove individual 
directories one by one, but this would take a while...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to