alchemist51 opened a new issue, #18863:
URL: https://github.com/apache/datafusion/issues/18863
### Describe the bug
We were trying to run the clickbench query on the partitioned data for
different target partition size and observed the results are different.
### To Reproduce
Download the partitioned data for hits using below command:
```
seq 0 99 | xargs -P100 -I{} bash -c 'wget --directory-prefix partitioned
--continue --progress=dot:giga
https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'
```
register the data folder as an external table from datafusion cli:
```
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION '~/hits/partitioned';
```
```
set datafusion.execution.target_partitions =3;
```
The results are like this:
```
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | sum(hits.IsRefresh) |
avg(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 8449123891155589752 | 558210368 | 1 | 0 | 1368.0
|
| 7249929090756875277 | 558210368 | 1 | 0 | 1368.0
|
| 8346088010501248028 | -2091064649 | 1 | 0 | 1638.0
|
| 8797199898703927977 | -2140306534 | 1 | 0 | 2038.0
|
| 7860441087193910310 | 1154898388 | 1 | 0 | 1638.0
|
| 6255708766253389085 | 1154898388 | 1 | 0 | 1638.0
|
+---------------------+-------------+---+---------------------+---------------------------+
```
When we change the target_partitions to 10, the results look like this:
```
+---------------------+-------------+---+---------------------+---------------------------+
| WatchID | ClientIP | c | sum(hits.IsRefresh) |
avg(hits.ResolutionWidth) |
+---------------------+-------------+---+---------------------+---------------------------+
| 7224410078130478461 | -776509581 | 2 | 0 | 1368.0
|
| 8566928176839891583 | -1402644643 | 2 | 0 | 1368.0
|
| 6655575552203051303 | 1611957945 | 2 | 0 | 1638.0
|
| 7904046282518428963 | 1509330109 | 2 | 0 | 1368.0
|
| 9074542984305678345 | 53624704 | 1 | 0 | 1750.0
|
| 8657316443585142993 | 1566983451 | 1 | 0 | 362.0
|
| 7644914580732725077 | 1328088797 | 1 | 0 | 1638.0
|
| 6952801458493199229 | 1328088797 | 1 | 0 | 1638.0
|
| 7412006209029830543 | -51933821 | 1 | 0 | 1250.0
|
| 7393014425039729645 | -51933821 | 1 | 0 | 1250.0
|
+---------------------+-------------+---+---------------------+---------------------------+
```
### Expected behavior
Consisent results for both cases.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]