[PR] HIVE-28262:Single column use MultiDelimitSerDe parse column error [hive]

via GitHub Thu, 16 May 2024 00:55:13 -0700


LaughingVzr opened a new pull request, #5252:
URL: https://github.com/apache/hive/pull/5252


   ### What changes were proposed in this pull request?
   modify LazyStruct#findIndexes function and LazyStruct#parseMultiDelimit 
function, change fields.length Conditional judgment：
   ```java
   public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
        - if (fields.length > 1 && delimitIndexes[i - 1] != -1) {
        + if (delimitIndexes[i - 1] != -1) {
     }
   
    private int[] findIndexes(byte[] array, byte[] target) {
       - if (fields.length <= 1) {
       + if (fields.length < 1) {
      ...
       - for (int i = 1; i < indexes.length; i++) {
       + for (int i = 1; i <= indexes.length; i++) {
       ...
       }
       return indexes;
     }
   ```
   
   I add an test for this fix:
   ```java
   @Test
       public void testParseMultiDelimit() throws Throwable {
           try {
               // single column named id
               List<String> columns = new ArrayList<>();
               columns.add("id");
               // column type is string
               List<TypeInfo> columnTypes = new ArrayList<>();
               PrimitiveTypeInfo primitiveTypeInfo = new PrimitiveTypeInfo();
               primitiveTypeInfo.setTypeName("string");
               columnTypes.add(primitiveTypeInfo);
   
               // separators + escapeChar => "|"
               byte[] separators = new byte[]{124, 2, 3, 4, 5, 6, 7, 8};
   
               // sequence =>"\N"
               Text sequence = new Text();
               sequence.set(new byte[]{92, 78});
   
               // create a lazy struct inspector
               ObjectInspector objectInspector = 
LazyFactory.createLazyStructInspector(columns, columnTypes, separators,
                       sequence, false, false, (byte) '0');
               LazyStruct lazyStruct = (LazyStruct) 
LazyFactory.createLazyObject(objectInspector);
   
               // origin row data
               String rowData = "1|@|";
               // row field delimiter
               String fieldDelimiter = "|@|";
   
               // parse row use multi delimit
               
lazyStruct.parseMultiDelimit(rowData.getBytes(StandardCharsets.UTF_8),
                       fieldDelimiter.getBytes(StandardCharsets.UTF_8));
   
               // check the first field and second field start position index
               // before fix result: 0,1
               // after fix result: 0,2
               Assert.assertArrayEquals(new int[]{0, 2}, 
lazyStruct.startPosition);
           } catch (Throwable e) {
               e.printStackTrace();
               throw e;
           }
   
       }
   ```
   
   ### Why are the changes needed?
   If a table only have one column field with multidelimit，query this column 
data is error data.
   When I use this data to do other operation(e.g cast use UDFToLong 
function),get result is NULL.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### Is the change a dependency upgrade?
   No
   
   ### How was this patch tested?
   test class: 
serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyStruct.java
   test function: 
org.apache.hadoop.hive.serde2.lazy.TestLazyStruct#testParseMultiDelimit
   test command: mvn test -Dtest=TestLazyStruct --pl serde
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HIVE-28262:Single column use MultiDelimitSerDe parse column error [hive]

Reply via email to