Hello Stamatis,

We use a recent or the latest commit in the master branch and run Hive on Tez 0.10.2.

For query 22, the slow execution seems to be related to the split size used in IcebergInputFormat.getSplits(). We will try to create a JIRA when we make more progress.

For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but there is a separate report that the result is correct on 100GB TPC-DS. Not sure why this happens, so we are going to run more experiments.

Best,

Sungwoo

On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:

Hi Sungwoo,

Many thanks for sharing your findings; interesting observations.

If you can please also share the project versions that you used for running
the experiments.

Best,
Stamatis

On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park <c...@pl.postech.ac.kr> wrote:

Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.

Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)

4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002

--- Sungwoo





Reply via email to