Is input-pruning broken in Hive3 (over Tez/MR3) for tablesample?

Pau Tallada Thu, 16 Jun 2022 02:12:51 -0700

Hi,

TLDR; I'm unable to get input-pruning working in Hive3 for tablesample
clause using buckets.


In Hive2, I had bucketed tables clustered using 256 buckets. Whenever I did
a
SELECT ... FROM whatever WHERE ...
I could see 256 mappers processing the entire table data.
When I did a
SELECT ... FROM whatever TABLESAMPLE(BUCKET 1 OUT OF 256) WHERE ...
I could see just *one* mapper, as the rest of the buckets got pruned out.

However, in Hive 3, also with bucketed tables, I'm unable to reproduce this
behaviour. With or without TABLESAMPLE, there are always 256 mappers
processing the data, and the amount reported in HDFS_BYTES_READ counter is
even a little higher when using TABLESAMPLE.

On an end note, input pruning DOES WORK when using TABLESAMPLE(1000 ROWS),
as reported in HDFS_BYTES_READ counter.

I tested also using Hive MR3 docker image, and it appeared not to be
working there either :(

Am I doing something wrong? This feature/behaviour is greatly needed when
working with statistical analysis of huuuuuge datasets...
Any help would be greatly appreciated 🤔

Cheers,

Pau.
-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Is input-pruning broken in Hive3 (over Tez/MR3) for tablesample?

Reply via email to