[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.
[ https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklaus Xiao updated HIVE-12412: Affects Version/s: 2.3.0 > Multi insert queries fail to run properly in hive 1.1.x or later. > - > > Key: HIVE-12412 > URL: https://issues.apache.org/jira/browse/HIVE-12412 > Project: Hive > Issue Type: Bug >Affects Versions: 1.2.0, 1.1.0, 2.3.0 >Reporter: John P. Petrakis > Labels: Correctness, CorrectnessBug > > We use multi insert queries to take data in one table and manipulate it by > inserting it into a results table. Queries are of this form: > from (select * from data_table lateral view explode(data_table.f2) f2 as > explode_f2) as explode_data_table >insert overwrite table results_table partition (q_id='C.P1',rl='1') >select >array(cast(if(explode_data_table.f1 is null or > explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as > String),cast(explode_f2.s1 as String)) as dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, >null as rownm >where (explode_data_table.date_id between 20151016 and 20151016) >group by >if(explode_data_table.f1 is null or explode_data_table.f1='', > 'UNKNOWN',explode_data_table.f1), >explode_f2.s1 >INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') >SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS > dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, >null AS rownm >WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) >INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') >SELECT >ARRAY(CAST('Total' as String)) AS dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, >null AS rownm >WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) > This query is meant to total a given field of a struct that is potentially a > list of structs. For our test data set, which consists of a single row, the > summation yields "Null", with messages in the hive log of the nature: > Missing fields! Expected 2 fields but only got 1! Ignoring similar problems. > or "Extra fields detected..." > For significantly more data, this query will eventually cause a run time > error while processing a column (caused by array index out of bounds > exception in one of the lazy binary classes such as LazyBinaryString or > LazyBinaryStruct). > Using the query above from the hive command line, the following data was used: > (note there are tabs in the data below) > string oneone:1.0:1.00:10.0,eon:1.0:1.00:100.0 > string twotwo:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0 > string thrthree:3.0:3.00:30.0 > string foufour:4.0:4.00:40.0 > There are two fields, a string, (eg. 'string one') and a list of structs. > The following is used to create the table: > create table if not exists t1 ( > f1 string, > f2 > array> > ) > partitioned by (clid string, date_id string) > row format delimited fields > terminated by '09' > collection items terminated by ',' > map keys terminated by ':' > lines terminated by '10' > location '/user/hive/warehouse/t1'; > And the following is used to load the data: > load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into > table t1 partition(client_id='987654321',date_id='20151016'); > The resulting table should yield the following: > ["string fou","four"] [4.0] nullC.P11 > ["string one","eon"] [1.0] nullC.P11 > ["string one","one"] [1.0] nullC.P11 > ["string thr","three"][3.0] nullC.P11 > ["string two","otw"] [2.0] nullC.P11 > ["string two","two"] [2.0] nullC.P11 > ["string two","wott"] [2.0] nullC.P11 > ["Total","Total"] [15.0] nullC.P20 > ["Total"] [15.0] nullC.P50 > However what we get is: > Hive Runtime Error while processing row > {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array > index out of bounds exception) > If we reduce the above data to a SINGLE row, the we don't get an exception > but the total fields come out as NULL. > The ONLY way this query would work is > 1) if I added a group by (date_id) or even group by ('') as the last line in > the query... or removed the last where clause for the final insert. (The > reason why w
[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.
[ https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carl Steinbach updated HIVE-12412: -- Labels: Correctness CorrectnessBug (was: ) > Multi insert queries fail to run properly in hive 1.1.x or later. > - > > Key: HIVE-12412 > URL: https://issues.apache.org/jira/browse/HIVE-12412 > Project: Hive > Issue Type: Bug >Affects Versions: 1.2.0, 1.1.0 >Reporter: John P. Petrakis > Labels: Correctness, CorrectnessBug > > We use multi insert queries to take data in one table and manipulate it by > inserting it into a results table. Queries are of this form: > from (select * from data_table lateral view explode(data_table.f2) f2 as > explode_f2) as explode_data_table >insert overwrite table results_table partition (q_id='C.P1',rl='1') >select >array(cast(if(explode_data_table.f1 is null or > explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as > String),cast(explode_f2.s1 as String)) as dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, >null as rownm >where (explode_data_table.date_id between 20151016 and 20151016) >group by >if(explode_data_table.f1 is null or explode_data_table.f1='', > 'UNKNOWN',explode_data_table.f1), >explode_f2.s1 >INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') >SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS > dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, >null AS rownm >WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) >INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') >SELECT >ARRAY(CAST('Total' as String)) AS dimensions, >ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, >null AS rownm >WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) > This query is meant to total a given field of a struct that is potentially a > list of structs. For our test data set, which consists of a single row, the > summation yields "Null", with messages in the hive log of the nature: > Missing fields! Expected 2 fields but only got 1! Ignoring similar problems. > or "Extra fields detected..." > For significantly more data, this query will eventually cause a run time > error while processing a column (caused by array index out of bounds > exception in one of the lazy binary classes such as LazyBinaryString or > LazyBinaryStruct). > Using the query above from the hive command line, the following data was used: > (note there are tabs in the data below) > string oneone:1.0:1.00:10.0,eon:1.0:1.00:100.0 > string twotwo:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0 > string thrthree:3.0:3.00:30.0 > string foufour:4.0:4.00:40.0 > There are two fields, a string, (eg. 'string one') and a list of structs. > The following is used to create the table: > create table if not exists t1 ( > f1 string, > f2 > array> > ) > partitioned by (clid string, date_id string) > row format delimited fields > terminated by '09' > collection items terminated by ',' > map keys terminated by ':' > lines terminated by '10' > location '/user/hive/warehouse/t1'; > And the following is used to load the data: > load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into > table t1 partition(client_id='987654321',date_id='20151016'); > The resulting table should yield the following: > ["string fou","four"] [4.0] nullC.P11 > ["string one","eon"] [1.0] nullC.P11 > ["string one","one"] [1.0] nullC.P11 > ["string thr","three"][3.0] nullC.P11 > ["string two","otw"] [2.0] nullC.P11 > ["string two","two"] [2.0] nullC.P11 > ["string two","wott"] [2.0] nullC.P11 > ["Total","Total"] [15.0] nullC.P20 > ["Total"] [15.0] nullC.P50 > However what we get is: > Hive Runtime Error while processing row > {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array > index out of bounds exception) > If we reduce the above data to a SINGLE row, the we don't get an exception > but the total fields come out as NULL. > The ONLY way this query would work is > 1) if I added a group by (date_id) or even group by ('') as the last line in > the query... or removed the last where clause for the final insert. (The