[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

2017-06-05 Thread Niklaus Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklaus Xiao updated HIVE-12412:

Affects Version/s: 2.3.0

> Multi insert queries fail to run properly in hive 1.1.x or later.
> -
>
> Key: HIVE-12412
> URL: https://issues.apache.org/jira/browse/HIVE-12412
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.1.0, 2.3.0
>Reporter: John P. Petrakis
>  Labels: Correctness, CorrectnessBug
>
> We use multi insert queries to take data in one table and manipulate it by 
> inserting it into a results table.  Queries are of this form:
> from (select * from data_table lateral view explode(data_table.f2) f2 as 
> explode_f2) as explode_data_table  
>insert overwrite table results_table partition (q_id='C.P1',rl='1') 
>select 
>array(cast(if(explode_data_table.f1 is null or 
> explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as 
> String),cast(explode_f2.s1 as String)) as dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
>null as rownm 
>where (explode_data_table.date_id between 20151016 and 20151016)
>group by 
>if(explode_data_table.f1 is null or explode_data_table.f1='', 
> 'UNKNOWN',explode_data_table.f1),
>explode_f2.s1 
>INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
>SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS 
> dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>null AS rownm 
>WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
>INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
>SELECT 
>ARRAY(CAST('Total' as String)) AS dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>null AS rownm 
>WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> This query is meant to total a given field of a struct that is potentially a 
> list of structs.  For our test data set, which consists of a single row, the 
> summation yields "Null",  with messages in the hive log of the nature:
> Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
> or "Extra fields detected..."
> For significantly more data, this query will eventually cause a run time 
> error while processing a column (caused by array index out of bounds 
> exception in one of the lazy binary classes such as LazyBinaryString or 
> LazyBinaryStruct).
> Using the query above from the hive command line, the following data was used:
> (note there are tabs in the data below)
> string oneone:1.0:1.00:10.0,eon:1.0:1.00:100.0
> string twotwo:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
> string thrthree:3.0:3.00:30.0
> string foufour:4.0:4.00:40.0
> There are two fields, a string, (eg. 'string one') and a list of structs.  
> The following is used to create the table:
> create table if not exists t1 (
>  f1 string, 
>   f2 
> array>
>  )
>   partitioned by (clid string, date_id string) 
>   row format delimited fields 
>  terminated by '09' 
>  collection items terminated by ',' 
>  map keys terminated by ':'
>  lines terminated by '10' 
>  location '/user/hive/warehouse/t1';
> And the following is used to load the data:
> load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into 
> table t1  partition(client_id='987654321',date_id='20151016');
> The resulting table should yield the following:
> ["string fou","four"] [4.0]   nullC.P11   
> ["string one","eon"]  [1.0]   nullC.P11   
> ["string one","one"]  [1.0]   nullC.P11   
> ["string thr","three"][3.0]   nullC.P11   
> ["string two","otw"]  [2.0]   nullC.P11   
> ["string two","two"]  [2.0]   nullC.P11   
> ["string two","wott"] [2.0]   nullC.P11   
> ["Total","Total"] [15.0]  nullC.P20   
> ["Total"] [15.0]  nullC.P50   
> However what we get is:
> Hive Runtime Error while processing row 
> {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array 
> index out of bounds exception)
> If we reduce the above data to a SINGLE row, the we don't get an exception 
> but the total fields come out as NULL.
> The ONLY way this query would work is 
> 1) if I added a group by (date_id) or even group by ('') as the last line in 
> the query... or removed the last where clause for the final insert.  (The 
> reason why w

[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

2016-10-19 Thread Carl Steinbach (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-12412:
--
Labels: Correctness CorrectnessBug  (was: )

> Multi insert queries fail to run properly in hive 1.1.x or later.
> -
>
> Key: HIVE-12412
> URL: https://issues.apache.org/jira/browse/HIVE-12412
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.1.0
>Reporter: John P. Petrakis
>  Labels: Correctness, CorrectnessBug
>
> We use multi insert queries to take data in one table and manipulate it by 
> inserting it into a results table.  Queries are of this form:
> from (select * from data_table lateral view explode(data_table.f2) f2 as 
> explode_f2) as explode_data_table  
>insert overwrite table results_table partition (q_id='C.P1',rl='1') 
>select 
>array(cast(if(explode_data_table.f1 is null or 
> explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as 
> String),cast(explode_f2.s1 as String)) as dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
>null as rownm 
>where (explode_data_table.date_id between 20151016 and 20151016)
>group by 
>if(explode_data_table.f1 is null or explode_data_table.f1='', 
> 'UNKNOWN',explode_data_table.f1),
>explode_f2.s1 
>INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
>SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS 
> dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>null AS rownm 
>WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
>INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
>SELECT 
>ARRAY(CAST('Total' as String)) AS dimensions, 
>ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>null AS rownm 
>WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> This query is meant to total a given field of a struct that is potentially a 
> list of structs.  For our test data set, which consists of a single row, the 
> summation yields "Null",  with messages in the hive log of the nature:
> Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
> or "Extra fields detected..."
> For significantly more data, this query will eventually cause a run time 
> error while processing a column (caused by array index out of bounds 
> exception in one of the lazy binary classes such as LazyBinaryString or 
> LazyBinaryStruct).
> Using the query above from the hive command line, the following data was used:
> (note there are tabs in the data below)
> string oneone:1.0:1.00:10.0,eon:1.0:1.00:100.0
> string twotwo:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
> string thrthree:3.0:3.00:30.0
> string foufour:4.0:4.00:40.0
> There are two fields, a string, (eg. 'string one') and a list of structs.  
> The following is used to create the table:
> create table if not exists t1 (
>  f1 string, 
>   f2 
> array>
>  )
>   partitioned by (clid string, date_id string) 
>   row format delimited fields 
>  terminated by '09' 
>  collection items terminated by ',' 
>  map keys terminated by ':'
>  lines terminated by '10' 
>  location '/user/hive/warehouse/t1';
> And the following is used to load the data:
> load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into 
> table t1  partition(client_id='987654321',date_id='20151016');
> The resulting table should yield the following:
> ["string fou","four"] [4.0]   nullC.P11   
> ["string one","eon"]  [1.0]   nullC.P11   
> ["string one","one"]  [1.0]   nullC.P11   
> ["string thr","three"][3.0]   nullC.P11   
> ["string two","otw"]  [2.0]   nullC.P11   
> ["string two","two"]  [2.0]   nullC.P11   
> ["string two","wott"] [2.0]   nullC.P11   
> ["Total","Total"] [15.0]  nullC.P20   
> ["Total"] [15.0]  nullC.P50   
> However what we get is:
> Hive Runtime Error while processing row 
> {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array 
> index out of bounds exception)
> If we reduce the above data to a SINGLE row, the we don't get an exception 
> but the total fields come out as NULL.
> The ONLY way this query would work is 
> 1) if I added a group by (date_id) or even group by ('') as the last line in 
> the query... or removed the last where clause for the final insert.  (The