Is action_id can be created as a numeric column: CREATE TABLE X ( action_id bigint, ..)
Bucketing or hash partitioning best works on numeric columns with high cardinality (say a primary key). >From my old notes: Bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1. In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files. One should define the number of buckets by a power of two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?) One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files will be fixed. In contrast, with partitioning you do not have this limitation. can you do show create table X and send the output. please. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 14 May 2016 at 12:23, no jihun <jees...@gmail.com> wrote: > Hello. > > I want to ask the correct bucketing and tablesample way. > > There is a table X which I created by > > CREATE TABLE `X`(`action_id` string,`classifier` string) > CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS > STORED AS ORC > > Then I inserted 500M of rows into X by > > set hive.enforce.bucketing=true; > INSERT OVERWRITE INTO X SELECT * FROM X_RAW > > Then I want to count or search some rows with condition. roughly, > > SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb' > > But I'd better to USE tablesample as I clustered X (action_id, > classifier). So, the better query will be > > SELECT COUNT(*) FROM X > TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) > WHERE action_id='aaa' AND classifier='bbb' > > Is there any wrong above? But I can't not find any performance gain > between these two query. > > query1 and RESULT( with no tablesample.) > > SELECT COUNT(*)) from X > WHERE action_id='aaa' and classifier='bbb' > > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 256 256 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 > s > > -------------------------------------------------------------------------------- > It scans full data. > > query 2 and RESULT > > SELECT COUNT(*)) from X > TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) > WHERE action_id='aaa' and classifier='bbb' > > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 256 256 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: > 15.82 s > > -------------------------------------------------------------------------------- > It ALSO scans full data. > > query 2 RESULT WHAT I EXPECTED. > > Result what I expected is something like... > (use 1 map and relatively faster than without tabmesample) > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 1 1 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: > 3.xx s > > -------------------------------------------------------------------------------- > > Values of action_id and classifier are well distributed and there is no > skewed data. > > So I want to ask you what will be a correct query that prune and target > specific bucket by multiple column? >