Those files are a bit smaller than ideal but not small enough that I would expect anything like the performance you are getting.
I can't speak for the Java implementation, I know very little about it. However, I have generated some test data that matches your description and loaded it with python. It took ~2 seconds to read all 4096 files. This is on a pretty standard desktop with 8 cores (16 threads). In fact, even if I restrict things to a single core, it only takes about 9 seconds. So no, this is not the performance I would expect. I don't think the fix is simply a matter of optimizing certain parameters. We are missing something. Have you tried the "Query Data Content For Directory" example?[1] Would you be able to generate some kind of profiling or flame chart? [1] https://arrow.apache.org/cookbook/java/dataset.html#id11 On Sun, Jul 2, 2023 at 1:13 PM Paulo Motta <[email protected]> wrote: > Each file has 1-2MB with 1 row group each, around 2000 rows per file and > 61 columns - a total of 7697842 rows. Is this performance expected for this > dataset or is there any suggestion to optimize? > > Thanks! > > On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <[email protected]> wrote: > >> What size are the row groups in your parquet files? How many columns and >> rows in the files? >> >> On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <[email protected]> >> wrote: >> >>> Hi, >>> >>> I'm trying to read 4096 parquet files with a total size of 6GB using >>> this cookbook: >>> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file >>> >>> I'm using 100 threads, each thread processing one file at a time on a 72 >>> core machine with 32GB heap. The files are pre-loaded in memory. >>> >>> However it's taking about 10 minutes to process these 4096 files with a >>> total size of only 6GB and the process seems to be cpu-bound. >>> >>> Is this expected read performance for parquet files or am I >>> doing something wrong? Any help or tips would be appreciated. >>> >>> Thanks, >>> >>> Paulo >>> >>
