Hello Stamatis,
We use a recent or the latest commit in the master branch and run Hive on Tez
0.10.2.
For query 22, the slow execution seems to be related to the split size used in
IcebergInputFormat.getSplits(). We will try to create a JIRA when we make more
progress.
For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but there
is a separate report that the result is correct on 100GB TPC-DS. Not sure why
this happens, so we are going to run more experiments.
Best,
Sungwoo
On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:
Hi Sungwoo,
Many thanks for sharing your findings; interesting observations.
If you can please also share the project versions that you used for running
the experiments.
Best,
Stamatis
On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park <c...@pl.postech.ac.kr> wrote:
Hello,
I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.
Here are a few findings.
1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.
2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds
3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)
4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002
--- Sungwoo