Hi all, We have a web platform in production[1] that uses Hive to facilitate access to massive cosmological datasets. When launched in 2016 over Hive 2.1.2 we used the TABLESAMPLE clause on clustered tables to allow quick subsampling of the data. However, we have been unable to get the same behaviour using Hive 3.1.2.
Tables are clustered following indications[2], but the queries always read all the data in the table. In fact, they read even more data (HDFS_BYTES_READ counter) when the tablesample clause is used. Example query on a very large table: SELECT SUM(float_column) FROM huge_clustered_table => 23552 tasks HDFS_BYTES_READ = 44475888168 (44G) SELECT SUM(float_column) FROM huge_clustered_table TABLESAMPLE(BUCKET1 OUT OF 1024) => *23552 tasks* *HDFS_BYTES_READ = 58372075670 (58G) (????)* However, using block sampling: SELECT SUM(float_column) FROM huge_clustered_table TABLESAMPLE(0.1 PERCENT) => *25 tasks* * HDFS_BYTES_READ = 45484944 (45M)* Please, any hint would be greatly appreciated! [1] https://cosmohub.pic.es [2] https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------