For reference, these are the scripts I used.

## To generate data
```
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

NUM_COLUMNS = 61
NUM_ROWS = 2048
NUM_FILES = 4096
DATA_DIR = "/home/pace/dev/data/small_files"

rng = np.random.default_rng()

for file_index in range(NUM_FILES):
    arrays = []
    names = []
    for column_index in range(NUM_COLUMNS):
arrays.append(rng.random(NUM_ROWS))
names.append(f"col{column_index}")
    table = pa.Table.from_arrays(arrays, names=names)
    pq.write_table(table, f"{DATA_DIR}/file_{file_index}.parquet")
```

## To then read the data
```
import time

import pyarrow.dataset as ds

DATA_DIR = "/home/pace/dev/data/small_files"

my_dataset = ds.dataset(DATA_DIR)
start = time.time()
my_table = my_dataset.to_table()
end = time.time()

print(f"Loaded a table with {my_table.num_rows} rows and
{my_table.num_columns} columns in {end - start} seconds")
```

On Mon, Jul 3, 2023 at 7:54 AM Weston Pace <weston.p...@gmail.com> wrote:

> Those files are a bit smaller than ideal but not small enough that I would
> expect anything like the performance you are getting.
>
> I can't speak for the Java implementation, I know very little about it.
> However, I have generated some test data that matches your description and
> loaded it with python.  It took ~2 seconds to read all 4096 files.  This is
> on a pretty standard desktop with 8 cores (16 threads).  In fact, even if I
> restrict things to a single core, it only takes about 9 seconds.
>
> So no, this is not the performance I would expect.  I don't think the fix
> is simply a matter of optimizing certain parameters.  We are missing
> something.
>
> Have you tried the "Query Data Content For Directory" example?[1]  Would
> you be able to generate some kind of profiling or flame chart?
>
> [1] https://arrow.apache.org/cookbook/java/dataset.html#id11
>
> On Sun, Jul 2, 2023 at 1:13 PM Paulo Motta <pauloricard...@gmail.com>
> wrote:
>
>> Each file has 1-2MB with 1 row group each, around 2000 rows per file and
>> 61 columns - a total of 7697842 rows. Is this performance expected for this
>> dataset or is there any suggestion to optimize?
>>
>> Thanks!
>>
>> On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <weston.p...@gmail.com>
>> wrote:
>>
>>> What size are the row groups in your parquet files?  How many columns
>>> and rows in the files?
>>>
>>> On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to read 4096 parquet files with a total size of 6GB using
>>>> this cookbook:
>>>> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file
>>>>
>>>> I'm using 100 threads, each thread processing one file at a time on a
>>>> 72 core machine with 32GB heap. The files are pre-loaded in memory.
>>>>
>>>> However it's taking about 10 minutes to process these 4096 files with a
>>>> total size of only 6GB and the process seems to be cpu-bound.
>>>>
>>>> Is this expected read performance for parquet files or am I
>>>> doing something wrong? Any help or tips would be appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Paulo
>>>>
>>>

Reply via email to