Re: [Java][Parquet] Bulk Read Performance

Weston Pace Mon, 03 Jul 2023 07:54:26 -0700

Those files are a bit smaller than ideal but not small enough that I would
expect anything like the performance you are getting.

I can't speak for the Java implementation, I know very little about it.
However, I have generated some test data that matches your description and
loaded it with python.  It took ~2 seconds to read all 4096 files.  This is
on a pretty standard desktop with 8 cores (16 threads).  In fact, even if I
restrict things to a single core, it only takes about 9 seconds.

So no, this is not the performance I would expect.  I don't think the fix
is simply a matter of optimizing certain parameters.  We are missing
something.

Have you tried the "Query Data Content For Directory" example?[1]  Would
you be able to generate some kind of profiling or flame chart?

[1] https://arrow.apache.org/cookbook/java/dataset.html#id11

On Sun, Jul 2, 2023 at 1:13 PM Paulo Motta <[email protected]> wrote:

> Each file has 1-2MB with 1 row group each, around 2000 rows per file and
> 61 columns - a total of 7697842 rows. Is this performance expected for this
> dataset or is there any suggestion to optimize?
>
> Thanks!
>
> On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <[email protected]> wrote:
>
>> What size are the row groups in your parquet files?  How many columns and
>> rows in the files?
>>
>> On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to read 4096 parquet files with a total size of 6GB using
>>> this cookbook:
>>> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file
>>>
>>> I'm using 100 threads, each thread processing one file at a time on a 72
>>> core machine with 32GB heap. The files are pre-loaded in memory.
>>>
>>> However it's taking about 10 minutes to process these 4096 files with a
>>> total size of only 6GB and the process seems to be cpu-bound.
>>>
>>> Is this expected read performance for parquet files or am I
>>> doing something wrong? Any help or tips would be appreciated.
>>>
>>> Thanks,
>>>
>>> Paulo
>>>
>>

Re: [Java][Parquet] Bulk Read Performance

Reply via email to