Re: clustered bucket and tablesample

2016-05-15 Thread no jihun
Thank you so much Talebzadeh. I have filed an issue . https://issues.apache.org/jira/browse/HIVE-13766 And I changed clustering column to int, based on string column hash. something like String actionClassifier = ""my_action,cl_900"; MessageDigest md = MessageDigest.getInstance(MD_5); byte

Re: clustered bucket and tablesample

2016-05-15 Thread Mich Talebzadeh
Hi, OK I tried your table. 0: jdbc:hive2://rhes564:10010/default> describe formatted bucket_x; OK +---+---+-+--+ | col_name| data_type |

Re: clustered bucket and tablesample

2016-05-15 Thread no jihun
OK Talebzadeh thanks. Have you ever tried tablesample with string value hash? something like 'tablesample( bucket 1 out of 256 on some_field='somedata')' I wrote a full scenario. # table creating Time taken: 0.155 seconds, Fetched: 36 row(s) hive> CREATE TABLE `bucket_x` ( > `classifi

Re: clustered bucket and tablesample

2016-05-14 Thread Mich Talebzadeh
This is your code SELECT COUNT(*) FROM X TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) WHERE action_id='aaa' AND classifier='bbb' Well I have a table dummy with 1 billion rows imported from Oracle as ORC format hive> show create table dummy; OK CREATE TABLE `dummy`( `id` int, `c

Re: clustered bucket and tablesample

2016-05-14 Thread no jihun
ah, as i mentioned both field type of action_id and classifier is STRING. and I can not change the type. CREATE TABLE `X`(`action_id` string,`classifier` string) CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS STORED AS ORC I use two fields for hash then bucketing because each one field is n

Re: clustered bucket and tablesample

2016-05-14 Thread Mich Talebzadeh
Is action_id can be created as a numeric column: CREATE TABLE X ( action_id bigint, ..) Bucketing or hash partitioning best works on numeric columns with high cardinality (say a primary key). >From my old notes: Bucketing in Hive refers to hash partitioning where a hashing function is applied.

clustered bucket and tablesample

2016-05-14 Thread no jihun
Hello. I want to ask the correct bucketing and tablesample way. There is a table X which I created by CREATE TABLE `X`(`action_id` string,`classifier` string) CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS STORED AS ORC Then I inserted 500M of rows into X by set hive.enforce.bucketing=tr