Hi, TLDR; I'm unable to get input-pruning working in Hive3 for tablesample clause using buckets.
In Hive2, I had bucketed tables clustered using 256 buckets. Whenever I did a SELECT ... FROM whatever WHERE ... I could see 256 mappers processing the entire table data. When I did a SELECT ... FROM whatever TABLESAMPLE(BUCKET 1 OUT OF 256) WHERE ... I could see just *one* mapper, as the rest of the buckets got pruned out. However, in Hive 3, also with bucketed tables, I'm unable to reproduce this behaviour. With or without TABLESAMPLE, there are always 256 mappers processing the data, and the amount reported in HDFS_BYTES_READ counter is even a little higher when using TABLESAMPLE. On an end note, input pruning DOES WORK when using TABLESAMPLE(1000 ROWS), as reported in HDFS_BYTES_READ counter. I tested also using Hive MR3 docker image, and it appeared not to be working there either :( Am I doing something wrong? This feature/behaviour is greatly needed when working with statistical analysis of huuuuuge datasets... Any help would be greatly appreciated 🤔 Cheers, Pau. -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------