> Is there some configuration parameter I can tweak? Is this a known issue? Has 
> it been addressed in newer versions?

Yes/Yes/Yes

There has been some work done here in recent releases.  I think some
of the biggest changes arrived in 4.0.0 but a few bug fixes have also
been done since then.

If you are reading single files then what you will want to look for is
parquet::ParquetFileReader::PreBuffer.  This function must be called
to give an indication of what data you plan to read.  Once you've done
that the reader will combine small reads into larger reads which
should reduce the total number of reads.  This should give a pretty
significant boost to S3 performance.

You may also want to look into the datasets API.  The datasets logic
not only prebuffers for you but also reads multiple files (and
multiple batches within a file) concurrently.

On Thu, Nov 4, 2021 at 12:05 PM Bipin Mathew <[email protected]> wrote:
>
> Hello Apache Arrow Team,
>
> I am using apache-arrow-3.0.0 and encountering a significant performance 
> issue reading parquet files over s3. I believe I have traced down the issue 
> to a very large number of curl requests being made apparently on an as-needed 
> basis ( see gdb trace below ). That is, there does not appear to be any 
> obvious buffering going on to amortize the over-the-wire latency. Am I doing 
> something wrong here? Is there some configuration parameter I can tweak? Is 
> this a known issue? Has it been addressed in newer versions? Any guidance 
> will be greatly appreciated.
>
> Kind Regards,
>
> Bipin
>
> Trace of program using gdb stopped at "arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt" Notice how small (nbytes) each request 
> is.
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=84273, nbytes=7846)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> (gdb) cont
> Continuing.
> [New Thread 0x7fffde68b700 (LWP 750147)]
> [Thread 0x7fffde68b700 (LWP 750147) exited]
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=92119, nbytes=6974)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> (gdb) cont
> Continuing.
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=99093, nbytes=7040)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> (gdb) cont
> Continuing.
> [New Thread 0x7fffde68b700 (LWP 750164)]
> [Thread 0x7fffde68b700 (LWP 750164) exited]
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=106133, nbytes=6875)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> (gdb) cont
> Continuing.
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=113008, nbytes=29380)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> (gdb) cont
> Continuing.
> [New Thread 0x7fffde68b700 (LWP 750181)]
> [Thread 0x7fffde68b700 (LWP 750181) exited]
>
> Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous 
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
>     position=142388, nbytes=26536)
>     at 
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,

Reply via email to