For reference, these are the scripts I used.
## To generate data
```
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
NUM_COLUMNS = 61
NUM_ROWS = 2048
NUM_FILES = 4096
DATA_DIR = "/home/pace/dev/data/small_files"
rng = np.random.default_rng()
for file_index in
Those files are a bit smaller than ideal but not small enough that I would
expect anything like the performance you are getting.
I can't speak for the Java implementation, I know very little about it.
However, I have generated some test data that matches your description and
loaded it with
Each file has 1-2MB with 1 row group each, around 2000 rows per file and 61
columns - a total of 7697842 rows. Is this performance expected for this
dataset or is there any suggestion to optimize?
Thanks!
On Sat, Jul 1, 2023 at 10:44 PM Weston Pace wrote:
> What size are the row groups in your
What size are the row groups in your parquet files? How many columns and
rows in the files?
On Sat, Jul 1, 2023, 6:08 PM Paulo Motta wrote:
> Hi,
>
> I'm trying to read 4096 parquet files with a total size of 6GB using this
> cookbook:
>
Hi,
I'm trying to read 4096 parquet files with a total size of 6GB using this
cookbook:
https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file
I'm using 100 threads, each thread processing one file at a time on a 72
core machine with 32GB heap. The files are pre-loaded in memory.