.json reading slow (drill)

via GitHub Tue, 04 Jul 2023 23:08:36 -0700


pandalanax opened a new issue, #2814:
URL: https://github.com/apache/drill/issues/2814

**Describe the bug**
We have a 20GB parquet file which we save in HDFS and regenerate monthly at
the moment. The file is stored in 128MB parts so ~160 .parquet files. This is
called our `history_data`
We UNION these files with .json files which have newer data. Example given:

Parquet file ranges from 2020-01-31 up to 2023-06-29.
Json files ranges from 2023-07-01 up to current date.

When issuing a query like:
```sql
SELECT * FROM dfs.co2meter.sensor_data_v3 order by `timestamp` limit 10
```
we observed that the Major Fragment 03-xx-xx containing:
- JSON_SUB_SCAN
- UNION_ALL
- PROJECT
- PROJECT
- PARQUET_ROW_GROUP_SCAN
only acts with min(# parquet files, # json files) threads/processes.
E.g.:
160 parquet files & 9 Json files == 9/9 Minor Fragments Reporting (super
slow)
160 parquet files & 160 Json files == 160/160 Minor Fragments Reporting
(fastest)
160 parquet files & 320 Json files == 160/160 Minor Fragments Reporting
(also ok fast)

Is there a config for this?

**To Reproduce**

**Expected behavior**
A clear and concise description of what you expected to happen.

**Error detail, log output or screenshots**
Prefer character data over screenshots for error messages and log output.

**Drill version**
1.17.0

**Additional context**
4 Drillbits.

160/160 Minor Fragments Reporting

![image](https://github.com/apache/drill/assets/38329801/0f3cb9fb-2c00-4c6c-80df-fc3d4a1777ae)

9/9 Minor Fragments Reporting

![image](https://github.com/apache/drill/assets/38329801/00d784e7-05f1-4146-8d1f-ca0c5f6629ed)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] dir/*.parquet UNION ALL dir/*.json reading slow (drill)

Reply via email to

[I] dir/.parquet UNION ALL dir/.json reading slow (drill)