John Omernik created DRILL-4284:
-----------------------------------

             Summary: Complex Data Causing Index Out of Bounds with UNION Type 
                 Key: DRILL-4284
                 URL: https://issues.apache.org/jira/browse/DRILL-4284
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - JSON
    Affects Versions: 1.4.0
            Reporter: John Omernik


Working with complex json data and the UNION type, has shown a Index out of 
Bounds error when trying to read data.  Here are the posts from the Drill User 
Group:

After getting some pointers on the new experimental Union type with json, I 
started getting a different error related to index out of bounds, I thought I'd 
post here to determine what it could be, and if a bug, I can then open a JIRA. 

So first, I did:

ALTER SESSION SET `exec.errors.verbose` = true;  -- So I could get full errors 
ALTER SESSION SET `exec.enable_union_type` = true; -- So I could use the 
experimental UNION type 

Now, my first query, select * from `/data/prod/src/`  gave me the errors below. 
 The files change, and ironically, if I select directly from any specific file 
(even the ones in the error) often times the query works fine.  It's going 
through a directory of files that cause the error. Sometimes I Can do multiple 
files, but often times, but I come to one file, and it seems to break it.  The 
file that breaks things doesn't look different from others, but at the same 
time, I can select directly from the file, and it works... weird.  Let know if 
I can do anything to help troubleshoot more. 

Data Notes (see example below): 
- The ... represents LOTs of other fields, some simple, some complex/nested. 
This data is NOT Pretty. 
- The files are goofy in that each file has one top level field of "count" then 
a huge array of events
- The field that is ALWAYS (as far as I've seen) is the "features" field
- This field will sometimes be an array and sometimes be an empty object. {}.  
- The size of the array for the features field (when not an empty object) does 
change from event to event.  (My hunch is an issue there)
- This occurs even if I don't reference the features field, say I am trying to 
flatten a different field at the same level as features. 
Error:

Error: DATA_READ ERROR: index: 0, length: 4 (expected: range(0, 0))
 
File  /data/prod/src/file1.json
Record  1
Line  193
Column  34
Field  feature
Fragment 0:0
 
[Error Id: 25a2c963-86db-40e9-b5cc-2674887de2fe on node7:31010]
 
  (java.lang.IndexOutOfBoundsException) index: 0, length: 4 (expected: range(0, 
0))
    io.netty.buffer.DrillBuf.checkIndexD():175
    io.netty.buffer.DrillBuf.chk():197
    io.netty.buffer.DrillBuf.getInt():477
    org.apache.drill.exec.vector.UInt4Vector$Accessor.get():356
    org.apache.drill.exec.vector.complex.ListVector$Mutator.startNewValue():305
    org.apache.drill.exec.vector.complex.impl.UnionListWriter.startList():563
    
org.apache.drill.exec.vector.complex.impl.AbstractPromotableFieldWriter.startList():126
    org.apache.drill.exec.vector.complex.impl.PromotableWriter.startList():42
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():461
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():470
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch():240
    org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector():178
    org.apache.drill.exec.vector.complex.fn.JsonReader.write():144
    org.apache.drill.exec.store.easy.json.JSONRecordReader.next():191
    org.apache.drill.exec.physical.impl.ScanBatch.next():191
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1595
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)



Example Data:

{
  "count": 241,
  "events": [
    {
                ...
                ...
                ...
                "features": [
        {
          "count": 3,
          "name": "feature1"
        },
        {
          "count": 30,
          "name": "feature2"
        },
        {
          "count": 2,
          "name": "feature3"
        },
        {
          "count": 3,
          "name": "feature4"
        }
      ],
                ...
                ...
    },
   {
   ...
   ...
   ...
   "features": {},
   ...
   },
    {
                ...
                ...
                ...
                "features": [
        {
          "count": 3,
          "name": "feature1"
        },
        {
          "count": 30,
          "name": "feature2"
        },
        {
          "count": 2,
          "name": "feature3"
       }
      ],
                ...
                ...
    }
]
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to