[ 
https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723276#comment-13723276
 ] 

Chaoyu Tang commented on HIVE-4223:
-----------------------------------

[~java8964] I was not able to reproduce the said problem in hive-0.9.0 and 
wondering if it might be related to the data? Here is my test case;
1. create table bcd (col1 array <struct<col1:string, col2:string, 
col3:string,col4:string,col5:string,col6:string,col7:string,col8:array<struct<col1:string,col2:string,col3:string,col4:string,col5:string,col6:string,col7:string,col8:string,col9:string>>>>)
 row format delimited fields terminated by '\001' collection items terminated 
by '\002' lines terminated by '\n' stored as textfile;
** should be same as you described
2. load data local inpath '/root/nest_struct.data' overwrite into table bcd;
** see attached nest_struct.data
3. select col1 from bcd;
** got:
[{"col1":"c1v","col2":"c2v","col3":"c3v","col4":"c4v","col5":"c5v","col6":"c6v","col7":"c7v","col8":[{"col1":"c11v","col2":"c22v","col3":"c33v","col4":"c44v","col5":"c55v","col6":"c66v","col7":"c77v","col8":"c88v","col9":"c99v"}]}]
....

Did you see anything different from your case?
Could you please update your case and probably I can have a try.

 
                
> LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of 
> hive table
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-4223
>                 URL: https://issues.apache.org/jira/browse/HIVE-4223
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.9.0
>         Environment: Hive 0.9.0
>            Reporter: Yong Zhang
>         Attachments: nest_struct.data
>
>
> The LazySimpleSerDe will throw IndexOutOfBoundsException if the column 
> structure is struct containing array of struct. 
> I have a table with one column defined like this:
> columnA
> array <
>     struct<
>        col1:primiType,
>        col2:primiType,
>        col3:primiType,
>        col4:primiType,
>        col5:primiType,
>        col6:primiType,
>        col7:primiType,
>        col8:array<
>             struct<
>               col1:primiType,
>               col2::primiType,
>               col3::primiType,
>               col4:primiType,
>               col5:primiType,
>               col6:primiType,
>               col7:primiType,
>               col8:primiType,
>               col9:primiType
>             >
>        >
>     >
> >
> In this example, the outside struct has 8 columns (including the array), and 
> the inner struct has 9 columns. As long as the outside struct has LESS column 
> count than the inner struct column count, I think we will get the following 
> exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row:
> Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8
>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>         at java.util.ArrayList.get(ArrayList.java:322)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
>         at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
>         ... 9 more
> I am not very sure about exactly the reason of this problem. I believe that 
> the   public static void serialize(ByteStream.Output out, Object 
> obj,ObjectInspector objInspector, byte[] separators, int level, Text 
> nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is 
> recursively invoking itself when facing nest structure. But for the nested 
> struct structure, the list reference will mass up, and the size() will return 
> wrong data.
> In the above example case I faced, 
> for these 2 lines:
>       List<? extends StructField> fields = soi.getAllStructFieldRefs();
>       list = soi.getStructFieldsDataAsList(obj);
> my StructObjectInspector(soi) will return the CORRECT data for 
> getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, 
> for one row, for the outsider 8 columns struct, I have 2 elements in the 
> inner array of struct, and each element will have 9 columns (as there are 9 
> columns in the inner struct). During runtime, after I added more logging in 
> the LazySimpleSerDe, I will see the following behavior in the logging:
> for 8 outside column, loop
>     for 9 inside columns, loop for serialize
>     for 9 inside columns, loop for serialize
> code broken here, for the outside loop, it will try to access the 9th 
> element,which not exist in the outside loop, as you will see the stracktrace 
> as it tried to access location 8 of size 8 of list.
> What I did is to change the following line of code, it look like fixing this 
> problem. But I don't know if it is the right way, but it did fix this 
> problem, and I did it on hive 0.9.0 version of code:
> 481c481,482
> <         for (int i = 0; i < list.size(); i++) {
> ---
> >         int listSize = list.size();
> >         for (int i = 0; i < listSize; i++) {
> I believe the reason of this bug is that if the code did the current way like
>         for (int i = 0; i < list.size(); i++)
> the method list.size() will be invoked for every loop. But in the nest 
> structure, the list.size() will return different result during the recursive 
> call, and that caused the problem I am currently facing.
> Thanks
> Yong Zhang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to