[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

Ahmed Riza (Jira) Fri, 12 Feb 2021 15:47:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284027#comment-17284027
 ]


Ahmed Riza edited comment on ARROW-6154 at 2/12/21, 11:46 PM:
--------------------------------------------------------------

I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#000000}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.

This is the initial stack trace when the footer is first read.  The code in 
`io.rs` gets called for every column subsequently as well when reading the 
columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in 
`parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource<std::fs::File>::new<std::fs::File> 
(fd=0x7ffff7c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x7ffff7c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x000055555590a3fc in parquet::file::footer::parse_metadata<std::fs::File> 
(chunk_reader=0x7ffff7c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x0000555555845db1 in 
parquet::file::serialized_reader::SerializedFileReader<std::fs::File>::new<std::fs::File>
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7ffff0000d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x000055555577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}


was (Author: dr.r...@gmail.com):
I've come across the same error. In my case it appears to be due to the 
`try_clone` calls in 
[https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  
I have a Parquet file with 3000 columns (see attached example), and the 
`try_clone` calls here eventually fail as it ends up creating too many open 
file descriptors{color:#000000}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can 
be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really 
a solution, since the code path ends up just creating potentially a very large 
number of open file descriptors.
{code:java}
#0  parquet::util::io::FileSource<std::fs::File>::new<std::fs::File> 
(fd=0x7ffff7c3fafc, start=807191, length=65536) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read 
(self=0x7ffff7c3fafc, start=807191, length=65536)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x000055555590a3fc in parquet::file::footer::parse_metadata<std::fs::File> 
(chunk_reader=0x7ffff7c3fafc) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x0000555555845db1 in 
parquet::file::serialized_reader::SerializedFileReader<std::fs::File>::new<std::fs::File>
 (chunk_reader=...)

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from 
(file=...) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from 
(path=0x7ffff0000d20) at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from 
(path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at 
/home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x000055555577c7f5 in 
data_rust::parquet::parquet_demo::test::test_read_multiple_files () at 
/work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

> [Rust] [Parquet] Too many open files (os error 24)
> --------------------------------------------------
>
>                 Key: ARROW-6154
>                 URL: https://issues.apache.org/jira/browse/ARROW-6154
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>            Reporter: Yesh
>            Priority: Major
>         Attachments: 
> part-00009-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet
>
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource<R>::new
>    8: <parquet::file::reader::SerializedRowGroupReader<R> as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9: <parquet::file::reader::SerializedRowGroupReader<R> as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16: <parquet::record::reader::RowIter as 
> core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6154) [Rust] [Parquet] Too many open files (os error 24)

Reply via email to