[GitHub] [arrow-rs] alamb commented on issue #47: [Parquet] Too many open files (os error 24)

GitBox Mon, 26 Apr 2021 04:24:09 -0700


alamb commented on issue #47:
URL: https://github.com/apache/arrow-rs/issues/47#issuecomment-826757214



   Comment from Chao Sun(csun) @ 2019-08-07T06:02:08.709+0000:
   <pre>Thanks for reporting. Do you have rough idea how deep the nested data 
type is? is there any error message? would be great if we can reproduce 
this.</pre>
   
   Comment from Yesh(madras) @ 2019-08-07T11:35:10.840+0000:
   <pre>Thanks for ack. Below is the error message.  Additional data point is 
that it is able to dump schema via parquet-schema . 
   {code:java}
   thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: 
General("underlying IO error: Too many open files (os error 24)")', 
src/libcore/result.rs:1084:5{code}</pre>
   
   Comment from Ahmed Riza(dr.r...@gmail.com) @ 2021-02-12T22:52:01.045+0000:
   <pre>I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#000000}.{color}
   
   Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This 
can be reproduced by using the attached Parquet file.
   
   One could increase the `ulimit -n` on Linux to get around this, but not 
really a solution, since the code path ends up just creating potentially a very 
large number of open file descriptors (one for each column in the Parquet file).
   
   This is the initial stack trace when the footer is first read.  
`FileSource<std::fs::File>::new` (in io.rs) gets called for every column 
subsequently as well when reading the columns (see {color:#cc844f}fn 
{color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)
   
    
   {code:java}
   #0  parquet::util::io::FileSource<std::fs::File>::new<std::fs::File> 
(fd=0x7ffff7c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82
   
   #1  0x00005555558294ce in 
parquet::file::serialized_reader::{{impl}}::get_read (self=0x7ffff7c3fafc, 
start=807191, length=65536)
   
       at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59
   
   #2  0x000055555590a3fc in 
parquet::file::footer::parse_metadata<std::fs::File> 
(chunk_reader=0x7ffff7c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57
   
   #3  0x0000555555845db1 in 
parquet::file::serialized_reader::SerializedFileReader<std::fs::File>::new<std::fs::File>
 (chunk_reader=...)
   
       at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134
   
   #4  0x0000555555845bb6 in 
parquet::file::serialized_reader::{{impl}}::try_from (file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81
   
   #5  0x0000555555845c4a in 
parquet::file::serialized_reader::{{impl}}::try_from (path=0x7ffff0000d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90
   
   #6  0x0000555555845d34 in 
parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")
   
       at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98
   
   #7  0x000055555577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103
   
   
    {code}</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #47: [Parquet] Too many open files (os error 24)

Reply via email to