For reference, these are the scripts I used. ## To generate data ``` import numpy as np import pyarrow as pa import pyarrow.parquet as pq
NUM_COLUMNS = 61 NUM_ROWS = 2048 NUM_FILES = 4096 DATA_DIR = "/home/pace/dev/data/small_files" rng = np.random.default_rng() for file_index in range(NUM_FILES): arrays = [] names = [] for column_index in range(NUM_COLUMNS): arrays.append(rng.random(NUM_ROWS)) names.append(f"col{column_index}") table = pa.Table.from_arrays(arrays, names=names) pq.write_table(table, f"{DATA_DIR}/file_{file_index}.parquet") ``` ## To then read the data ``` import time import pyarrow.dataset as ds DATA_DIR = "/home/pace/dev/data/small_files" my_dataset = ds.dataset(DATA_DIR) start = time.time() my_table = my_dataset.to_table() end = time.time() print(f"Loaded a table with {my_table.num_rows} rows and {my_table.num_columns} columns in {end - start} seconds") ``` On Mon, Jul 3, 2023 at 7:54 AM Weston Pace <weston.p...@gmail.com> wrote: > Those files are a bit smaller than ideal but not small enough that I would > expect anything like the performance you are getting. > > I can't speak for the Java implementation, I know very little about it. > However, I have generated some test data that matches your description and > loaded it with python. It took ~2 seconds to read all 4096 files. This is > on a pretty standard desktop with 8 cores (16 threads). In fact, even if I > restrict things to a single core, it only takes about 9 seconds. > > So no, this is not the performance I would expect. I don't think the fix > is simply a matter of optimizing certain parameters. We are missing > something. > > Have you tried the "Query Data Content For Directory" example?[1] Would > you be able to generate some kind of profiling or flame chart? > > [1] https://arrow.apache.org/cookbook/java/dataset.html#id11 > > On Sun, Jul 2, 2023 at 1:13 PM Paulo Motta <pauloricard...@gmail.com> > wrote: > >> Each file has 1-2MB with 1 row group each, around 2000 rows per file and >> 61 columns - a total of 7697842 rows. Is this performance expected for this >> dataset or is there any suggestion to optimize? >> >> Thanks! >> >> On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <weston.p...@gmail.com> >> wrote: >> >>> What size are the row groups in your parquet files? How many columns >>> and rows in the files? >>> >>> On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <pauloricard...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm trying to read 4096 parquet files with a total size of 6GB using >>>> this cookbook: >>>> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file >>>> >>>> I'm using 100 threads, each thread processing one file at a time on a >>>> 72 core machine with 32GB heap. The files are pre-loaded in memory. >>>> >>>> However it's taking about 10 minutes to process these 4096 files with a >>>> total size of only 6GB and the process seems to be cpu-bound. >>>> >>>> Is this expected read performance for parquet files or am I >>>> doing something wrong? Any help or tips would be appreciated. >>>> >>>> Thanks, >>>> >>>> Paulo >>>> >>>