Hi all,

We have a web platform in production[1] that uses Hive to facilitate access
to massive cosmological datasets.
When launched in 2016 over Hive 2.1.2 we used the TABLESAMPLE clause on
clustered tables to allow quick subsampling of the data.
However, we have been unable to get the same behaviour using Hive 3.1.2.

Tables are clustered following indications[2], but the queries always read
all the data in the table.
In fact, they read even more data (HDFS_BYTES_READ counter) when the
tablesample clause is used.

Example query on a very large table:

SELECT SUM(float_column) FROM huge_clustered_table
=>
23552 tasks
HDFS_BYTES_READ  = 44475888168 (44G)

SELECT SUM(float_column) FROM huge_clustered_table
TABLESAMPLE(BUCKET1 OUT OF 1024)
=>
*23552 tasks*

*HDFS_BYTES_READ  = 58372075670 (58G) (????)*

However, using block sampling:

SELECT SUM(float_column) FROM huge_clustered_table
TABLESAMPLE(0.1 PERCENT)
=>

*25 tasks*
* HDFS_BYTES_READ  = 45484944 (45M)*

Please, any hint would be greatly appreciated!

[1] https://cosmohub.pic.es
[2] https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling
-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Reply via email to