Re: [Java][Parquet] Bulk Read Performance

2023-07-03 Thread Weston Pace
For reference, these are the scripts I used. ## To generate data ``` import numpy as np import pyarrow as pa import pyarrow.parquet as pq NUM_COLUMNS = 61 NUM_ROWS = 2048 NUM_FILES = 4096 DATA_DIR = "/home/pace/dev/data/small_files" rng = np.random.default_rng() for file_index in

Re: [Java][Parquet] Bulk Read Performance

2023-07-03 Thread Weston Pace
Those files are a bit smaller than ideal but not small enough that I would expect anything like the performance you are getting. I can't speak for the Java implementation, I know very little about it. However, I have generated some test data that matches your description and loaded it with

Re: [Java][Parquet] Bulk Read Performance

2023-07-02 Thread Paulo Motta
Each file has 1-2MB with 1 row group each, around 2000 rows per file and 61 columns - a total of 7697842 rows. Is this performance expected for this dataset or is there any suggestion to optimize? Thanks! On Sat, Jul 1, 2023 at 10:44 PM Weston Pace wrote: > What size are the row groups in your

Re: [Java][Parquet] Bulk Read Performance

2023-07-01 Thread Weston Pace
What size are the row groups in your parquet files? How many columns and rows in the files? On Sat, Jul 1, 2023, 6:08 PM Paulo Motta wrote: > Hi, > > I'm trying to read 4096 parquet files with a total size of 6GB using this > cookbook: >

[Java][Parquet] Bulk Read Performance

2023-07-01 Thread Paulo Motta
Hi, I'm trying to read 4096 parquet files with a total size of 6GB using this cookbook: https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file I'm using 100 threads, each thread processing one file at a time on a 72 core machine with 32GB heap. The files are pre-loaded in memory.