This is your code SELECT COUNT(*) FROM X TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) WHERE action_id='aaa' AND classifier='bbb'
Well I have a table dummy with 1 billion rows imported from Oracle as ORC format hive> show create table dummy; OK CREATE TABLE `dummy`( `id` int, `clustered` int, `scattered` int, `randomised` int, `random_string` varchar(50), `small_vc` varchar(10), `padding` varchar(10)) *CLUSTERED BY ( id)INTO 256 BUCKETS*ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='1', 'numRows'='100000000', 'orc.bloom.filter.columns'='ID', 'orc.bloom.filter.fpp'='0.05', 'orc.compress'='SNAPPY', 'orc.create.index'='true', 'orc.row.index.stride'='10000', 'orc.stripe.size'='16777216', 'rawDataSize'='0', 'totalSize'='5662644579', 'transient_lastDdlTime'='1463245925') If I turn on the plan for the following two cases. First a simple case hive> *EXPLAIN SELECT COUNT(1) FROM dummy > where id = 20;*OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (GROUP, 1) DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7 Vertices: Map 1 Map Operator Tree: TableScan alias: dummy Statistics: Num rows: 100000000 Data size: 5662644736 Basic stats: COMPLETE Column stats: NONE Filter Operator * predicate: (id = 20) (type: boolean) * Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(1) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 2 Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.064 seconds, Fetched: 51 row(s) Now we try with tablesample but assigning predicate values inside the bracket as below because you are looking in bucket 1 for those values and you want optimizer to know that. *hive> EXPLAIN SELECT COUNT(1) FROM dummy > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID = 10)* > ; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (GROUP, 1) DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10 Vertices: Map 1 Map Operator Tree: TableScan alias: dummy Statistics: Num rows: 100000000 Data size: 5662644736 Basic stats: COMPLETE Column stats: NONE Filter Operator * predicate: (((hash((id = 10)) & 2147483647) % 256) = 0) (type: boolean)* Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(1) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 2 Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Otherwise I don't see much happening hive> EXPLAIN SELECT COUNT(1) FROM dummy > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID) > WHERE ID = 10; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (GROUP, 1) DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13 Vertices: Map 1 Map Operator Tree: TableScan alias: dummy Statistics: Num rows: 100000000 Data size: 5662644736 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (false and (id = 10)) (type: boolean) Statistics: Num rows: 25000000 Data size: 1415661184 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 25000000 Data size: 1415661184 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(1) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 2 Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink In general in my experience bucketing in ORC is the only area where ORC tables come handy. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 14 May 2016 at 13:38, no jihun <jees...@gmail.com> wrote: > ah, as i mentioned > both field type of action_id and classifier is STRING. and I can not > change the type. > > CREATE TABLE `X`(`action_id` string,`classifier` string) > CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS > STORED AS ORC > > I use two fields for hash then bucketing because each one field is not so > well distributed. > > my concern is not about the strong hash source but about How can I > tablesample to the a bucket by field value what provided by 'where clause' > > when I clustered by string fields which one is right for tablesample? > 1. provide fields > TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) > > 2. provide values > TABLESAMPLE(BUCKET 1 OUT OF 256 ON 'aaa', 'bbb') > 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebza...@gmail.com>님이 작성: > > Is action_id can be created as a numeric column: >> >> CREATE TABLE X ( action_id bigint, ..) >> >> Bucketing or hash partitioning best works on numeric columns with high >> cardinality (say a primary key). >> >> From my old notes: >> >> Bucketing in Hive refers to hash partitioning where a hashing function is >> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing >> algorithm to prevent data from clustering within specific partitions. >> Hashing is very effective if the column selected for bucketing has very >> high selectivity like an ID column where selectivity (select >> count(distinct(column))/count(column) ) = 1. In this case, the created >> partitions/ files will be as evenly sized as possible. In a nutshell >> bucketing is a method to get data evenly distributed over many >> partitions/files. One should define the number of buckets by a power of >> two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing >> will help concurrency in Hive. It may even allow a partition wise join i.e. >> a join between two tables that are bucketed on the same column with the >> same number of buckets (anyone has tried this?) >> >> >> >> One more things. When one defines the number of buckets at table creation >> level in Hive, the number of partitions/files will be fixed. In contrast, >> with partitioning you do not have this limitation. >> >> can you do >> >> show create table X >> >> and send the output. please. >> >> >> >> Thanks >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 14 May 2016 at 12:23, no jihun <jees...@gmail.com> wrote: >> >>> Hello. >>> >>> I want to ask the correct bucketing and tablesample way. >>> >>> There is a table X which I created by >>> >>> CREATE TABLE `X`(`action_id` string,`classifier` string) >>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS >>> STORED AS ORC >>> >>> Then I inserted 500M of rows into X by >>> >>> set hive.enforce.bucketing=true; >>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW >>> >>> Then I want to count or search some rows with condition. roughly, >>> >>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb' >>> >>> But I'd better to USE tablesample as I clustered X (action_id, >>> classifier). So, the better query will be >>> >>> SELECT COUNT(*) FROM X >>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >>> WHERE action_id='aaa' AND classifier='bbb' >>> >>> Is there any wrong above? But I can't not find any performance gain >>> between these two query. >>> >>> query1 and RESULT( with no tablesample.) >>> >>> SELECT COUNT(*)) from X >>> WHERE action_id='aaa' and classifier='bbb' >>> >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING >>> FAILED KILLED >>> ------------------------------------------------------------ >>> -------------------- >>> Map 1 .......... SUCCEEDED 256 256 0 0 >>> 0 0 >>> Reducer 2 ...... SUCCEEDED 1 1 0 0 >>> 0 0 >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >>> 15.35 s >>> ------------------------------------------------------------ >>> -------------------- >>> It scans full data. >>> >>> query 2 and RESULT >>> >>> SELECT COUNT(*)) from X >>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >>> WHERE action_id='aaa' and classifier='bbb' >>> >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING >>> FAILED KILLED >>> ------------------------------------------------------------ >>> -------------------- >>> Map 1 .......... SUCCEEDED 256 256 0 0 >>> 0 0 >>> Reducer 2 ...... SUCCEEDED 1 1 0 0 >>> 0 0 >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >>> 15.82 s >>> ------------------------------------------------------------ >>> -------------------- >>> It ALSO scans full data. >>> >>> query 2 RESULT WHAT I EXPECTED. >>> >>> Result what I expected is something like... >>> (use 1 map and relatively faster than without tabmesample) >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING >>> FAILED KILLED >>> ------------------------------------------------------------ >>> -------------------- >>> Map 1 .......... SUCCEEDED 1 1 0 0 >>> 0 0 >>> Reducer 2 ...... SUCCEEDED 1 1 0 0 >>> 0 0 >>> ------------------------------------------------------------ >>> -------------------- >>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >>> 3.xx s >>> ------------------------------------------------------------ >>> -------------------- >>> >>> Values of action_id and classifier are well distributed and there is no >>> skewed data. >>> >>> So I want to ask you what will be a correct query that prune and target >>> specific bucket by multiple column? >>> >> >>