Re: confused on different behavior of Bucketized tables do not support INSERT INTO

2012-05-31 Thread Bruce Bian
I'm using hive 0.9.0

On Thursday, May 31, 2012, Bruce Bian wrote:

 Hi,
 I've got a table vt_new_data which is defined as follows:
 CREATE TABLE VT_NEW_DATA
 (
  V_ACCOUNT_NUM string
 ,V_ACCOUNT_MODIFIER_NUM string
 ,V_DEPOSIT_TYPE_CD string
 ,V_DEPOSIT_TERM int
 ,V_LEDGER_SUBJECT_ID string
 ,V_ACCOUNTING_ORG_CD string
 ,V_OPEN_DT string
 ,V_CLOSE_DT string
 ,V_CURRENCY_CD string
 ,V_ACCOUNT_BAL float
 ,V_INNER_MONTH_DELAY_ACCUM float
 ) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY
 (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS
 RCFile;

 when I execute the following query
 explain insert into table vt_new_data select * from vt_new_data limit 1;
 (this is just a test)
 an FAILED: Error in semantic analysis: Bucketized tables do not support
 INSERT INTO: Table: vt_new_data error is occurred

 but when I execute the query:
 explain insert into table vt_new_data
 select /*+ MAPJOIN(T4) */
  t1.account_num as v_account_num
 ,t1.account_modifier_num as v_account_modifier_num
 ,'3006' as v_deposit_type_cd
 ,0 as  v_deposit_term
 ,'23201000' v_ledger_subject_id
 ,coalesce(t2.party_id,'')  as v_accounting_org_cd
 ,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
 ,coalesce(t3.card_live_dt,'19000101') as v_close_dt
 ,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as
 v_currency_cd
 ,coalesce(t4.agt_amt,0) as v_account_bal
 ,0 as v_inner_month_delay_accum
 from t03_e_cash_bucket t1
 left outer join t03_agt_amount_h_bucket t4
 on t1.account_num=t4.account_num
 and t1.account_modifier_num=t4.account_modifier_num
 and t4.agt_amt_type_cd = '001'
 and t4.start_date='$TXNDATE'
 and t4.end_date'$TXNDATE'
 left outer join t01_party_card_rela_h_bucket t2
 on  t1.card_no = t2.card_no
 and t2.party_card_rela_type_cd = '01'
 and t2.start_date='$TXNDATE'
 and t2.end_date'$TXNDATE'
 left outer join t03_card_bucket t3
 on t1.card_no = t3.card_no;

 the execution plan is generated successfully and triggered an SMB Map
 Join, which is great.

 But I don't see the difference here? As both are inserting into a
 bucketized and sorted table?



Re: confused on different behavior of Bucketized tables do not support INSERT INTO

2012-05-31 Thread Bruce Bian
So I did another test on this.
hive create table test(foo int,bar string) clustered by(foo) sorted by
(foo asc) into 2 buckets;
OK
Time taken: 0.097 seconds
hive create table test2 (foo int,bar string) clustered by(foo) sorted by
(foo asc) into 2 buckets;
OK
hive LOAD DATA LOCAL INPATH 'hive/examples/files/kv1.txt' OVERWRITE INTO
TABLE test;
hive set hive.enforce.bucketing=true;
hive set hive.enforce.sorting=true;
hive insert into table test2 select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
……
hive insert into table test2 select * from test2;
FAILED: Error in semantic analysis: Bucketized tables do not support INSERT
INTO: Table: test2

Seems like the errorFAILED: Error in semantic analysis: Bucketized tables
do not support INSERT INTO: Table: vt_new_data error is occurred
 is only thrown when insert into a bucketized table from the same table?
And when insert into a bucketized table multi-times, it will create a
original_file_copy_n under the same bucket.

-rw-r--r--   3 wbian supergroup   2856 2012-05-31 22:03
/user/hive/warehouse/test2/00_0
-rw-r--r--   3 wbian supergroup   2856 2012-05-31 22:04
/user/hive/warehouse/test2/00_0_copy_1
-rw-r--r--   3 wbian supergroup   2956 2012-05-31 22:03
/user/hive/warehouse/test2/01_0
-rw-r--r--   3 wbian supergroup   2956 2012-05-31 22:04
/user/hive/warehouse/test2/01_0_copy_1

And since what I want to do is SMB Map Join, the following triggered the
SMB Map Join successfully
set hive.optimize.bucketmapjoin= true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select /*+mapjoin(test)*/ * from pokes join test on pokes.foo=test.foo;

So what's the reason for throwing that error(i mean why not support insert
into a bucketized table from the same table)?And isn't that error message
kind of misleading?



On Thu, May 31, 2012 at 6:43 PM, Bruce Bian weidong@gmail.com wrote:

 I'm using hive 0.9.0

 On Thursday, May 31, 2012, Bruce Bian wrote:

 Hi,
 I've got a table vt_new_data which is defined as follows:
 CREATE TABLE VT_NEW_DATA
 (
  V_ACCOUNT_NUM string
 ,V_ACCOUNT_MODIFIER_NUM string
 ,V_DEPOSIT_TYPE_CD string
 ,V_DEPOSIT_TERM int
 ,V_LEDGER_SUBJECT_ID string
 ,V_ACCOUNTING_ORG_CD string
 ,V_OPEN_DT string
 ,V_CLOSE_DT string
 ,V_CURRENCY_CD string
 ,V_ACCOUNT_BAL float
 ,V_INNER_MONTH_DELAY_ACCUM float
 ) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY
 (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS
 RCFile;

 when I execute the following query
 explain insert into table vt_new_data select * from vt_new_data limit 1;
 (this is just a test)
 an FAILED: Error in semantic analysis: Bucketized tables do not support
 INSERT INTO: Table: vt_new_data error is occurred

 but when I execute the query:
 explain insert into table vt_new_data
 select /*+ MAPJOIN(T4) */
  t1.account_num as v_account_num
 ,t1.account_modifier_num as v_account_modifier_num
 ,'3006' as v_deposit_type_cd
 ,0 as  v_deposit_term
 ,'23201000' v_ledger_subject_id
 ,coalesce(t2.party_id,'')  as v_accounting_org_cd
 ,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
 ,coalesce(t3.card_live_dt,'19000101') as v_close_dt
 ,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as
 v_currency_cd
 ,coalesce(t4.agt_amt,0) as v_account_bal
 ,0 as v_inner_month_delay_accum
 from t03_e_cash_bucket t1
 left outer join t03_agt_amount_h_bucket t4
 on t1.account_num=t4.account_num
 and t1.account_modifier_num=t4.account_modifier_num
 and t4.agt_amt_type_cd = '001'
 and t4.start_date='$TXNDATE'
 and t4.end_date'$TXNDATE'
 left outer join t01_party_card_rela_h_bucket t2
 on  t1.card_no = t2.card_no
 and t2.party_card_rela_type_cd = '01'
 and t2.start_date='$TXNDATE'
 and t2.end_date'$TXNDATE'
 left outer join t03_card_bucket t3
 on t1.card_no = t3.card_no;

 the execution plan is generated successfully and triggered an SMB Map
 Join, which is great.

 But I don't see the difference here? As both are inserting into a
 bucketized and sorted table?




Re: confused on different behavior of Bucketized tables do not support INSERT INTO

2012-05-31 Thread Mark Grover
Hi Bruce,
It seems to me that your issue may be two-fold.

1) The JIRA that introduced insert into table 
(https://issues.apache.org/jira/browse/HIVE-306) in Hive 0.8 only works for 
non-bucketed partitions or tables. This is because appending inside a table or 
(its partition, if it's partitioned) works by adding a new file within the 
folder on HDFS. Since bucketing in a partition is implemented as files with a 
folder, this would break bucketing. To allow for insert into support for 
bucketed tables, we will probably have to regenerate the entire bucket(s) 
within the partition. Not sure if this is being done or has already been done 
in a newer release.

2) It seems like there was a minor bug in the implementation. As mentioned by 
this ticket (https://issues.apache.org/jira/browse/HIVE-3064) where insert 
into table works as insert overwrite table if the table name in the 
insert statement has upper case characters (black magic, eh?). Regardless of 
whether your table was created using upper or lower case letters, this bug 
manifests itself if you use upper case characters for the table name in your 
insert into table query.

To summarize, you shouldn't be calling insert into on bucketed tables for the 
reasons stated above. Also, if you are using insert into command, try to 
restrict your table names to lower case letters.

I have also verified that insert into table dummy select * from dummy works 
when dummy is a non-bucketed table. As a matter of fact, insert overwrite 
table dummy select * from dummy works too.

I would have expected your query insert into table test2 select * from test; 
to fail since test2 is bucketed as well. However, it doesn't. This seems to be 
a bug and I have created a JIRA for this 
(https://issues.apache.org/jira/browse/HIVE-3077).

Hope that helps,
Mark

- Original Message -
From: Bruce Bian weidong@gmail.com
To: user@hive.apache.org
Sent: Thursday, May 31, 2012 10:16:03 AM
Subject: Re: confused on different behavior of Bucketized tables do not support 
INSERT INTO

So I did another test on this. 

hive create table test(foo int,bar string) clustered by(foo) sorted by (foo 
asc) into 2 buckets; 
OK 
Time taken: 0.097 seconds 
hive create table test2 (foo int,bar string) clustered by(foo) sorted by (foo 
asc) into 2 buckets; 
OK 

hive LOAD DATA LOCAL INPATH 'hive/examples/files/kv1.txt' OVERWRITE INTO TABLE 
test; 

hive set hive.enforce.bucketing=true; 
hive set hive.enforce.sorting=true; 

hive insert into table test2 select * from test; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
…… 

hive insert into table test2 select * from test2; 
FAILED: Error in semantic analysis: Bucketized tables do not support INSERT 
INTO: Table: test2 


Seems like the errorFAILED: Error in semantic analysis: Bucketized tables do 
not support INSERT INTO: Table: vt_new_data error is occurred 
 is only thrown when insert into a bucketized table from the same table? And 
when insert into a bucketized table multi-times, it will create a 
original_file_copy_n under the same bucket. 



-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:03 
/user/hive/warehouse/test2/00_0 
-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:04 
/user/hive/warehouse/test2/00_0_copy_1 
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:03 
/user/hive/warehouse/test2/01_0 
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:04 
/user/hive/warehouse/test2/01_0_copy_1 


And since what I want to do is SMB Map Join, the following triggered the SMB 
Map Join successfully 

set hive.optimize.bucketmapjoin= true; 
set hive.optimize.bucketmapjoin.sortedmerge = true; 
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
select /*+mapjoin(test)*/ * from pokes join test on pokes.foo=test.foo; 


So what's the reason for throwing that error(i mean why not support insert into 
a bucketized table from the same table)?And isn't that error message kind of 
misleading? 








On Thu, May 31, 2012 at 6:43 PM, Bruce Bian  weidong@gmail.com  wrote: 


I'm using hive 0.9.0 



On Thursday, May 31, 2012, Bruce Bian wrote: 


Hi, 
I've got a table vt_new_data which is defined as follows: 

CREATE TABLE VT_NEW_DATA 
( 
V_ACCOUNT_NUM string 
,V_ACCOUNT_MODIFIER_NUM string 
,V_DEPOSIT_TYPE_CD string 
,V_DEPOSIT_TERM int 
,V_LEDGER_SUBJECT_ID string 
,V_ACCOUNTING_ORG_CD string 
,V_OPEN_DT string 
,V_CLOSE_DT string 
,V_CURRENCY_CD string 
,V_ACCOUNT_BAL float 
,V_INNER_MONTH_DELAY_ACCUM float 
) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY 
(V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS RCFile; 


when I execute the following query 
explain insert into table vt_new_data select * from vt_new_data limit 1; (this 
is just a test) 
an FAILED: Error in semantic analysis: Bucketized tables do not support INSERT 
INTO: Table: vt_new_data error is occurred 


but when I execute the query: 

explain insert into table vt_new_data