[ https://issues.apache.org/jira/browse/DRILL-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872917#comment-15872917 ]
ASF GitHub Bot commented on DRILL-4842: --------------------------------------- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/594 The bug here is fundamental to the way Drill works with JSON. We already had an extensive discussion around this area in another PR. The problem is that JSON supports a null type which is independent of all other types. In JSON, a null is not a "null int" or a "null string" -- it is just null. Drill must infer a type for a field. This leads to all kinds of grief when a file contains a run of nulls before the real value: {code} { id: 1, b: null } ... { id: 80000, b: "gee, I'm a string!" } {code} Drill must do something with the leading values. "b" is a null... what? Int? String? We've had many bugs in this area. The bugs are not just code bugs, they represent a basic incompatibility between Drill and JSON. This fix is yet another attempt to work around the limitation, but cannot overcome the basic incompatibility. What we are doing, it seems, is building a list of fields that have seen only null values, deferring action on those fields until later. That works fine if "later" occurs in the same record batch. It is not clear what happens if we get to the end of the batch (as in the example above), but have never seen the type of the field: what type of vector do we create? There are several solutions. One is to have a "null" type in Drill. When we see the initial run of nulls, we simply create a field of the "null" type. We have type conversion rules that say that a "null" vector can be coerced into any other type when we ultimately see the type. (And, if we don't see a type in one batch, we can pass the null vector along upstream for later reconciliation.) This is a big change; too big for a bug fix. Another solution, used here, is to keep track of "null only" fields, to defer the decision for later. That has a performance impact. A third solution is to go ahead and create a vector of any type, keep setting its values to null (as if we had already seen the field type), but be ready to discard that vector and convert it to the proper type once we see that type. In this way, we treat null fields just as any other up to the point where we realize we have a type conflict. Only then do we check the "null only" map and decide we can quietly convert the vector type to the proper type. These are the initial thoughts. I'll add more nuanced comments as I review the code in more detail. > SELECT * on JSON data results in NumberFormatException > ------------------------------------------------------ > > Key: DRILL-4842 > URL: https://issues.apache.org/jira/browse/DRILL-4842 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow > Affects Versions: 1.2.0 > Reporter: Khurram Faraaz > Assignee: Serhii Harnyk > Labels: ready-to-commit > Attachments: tooManyNulls.json > > > Note that doing SELECT c1 returns correct results, the failure is seen when > we do SELECT star. json.all_text_mode was set to true. > JSON file tooManyNulls.json has one key c1 with 4096 nulls as its value and > the 4097th key c1 has the value "Hello World" > git commit ID : aaf220ff > MapR Drill 1.8.0 RPM > {noformat} > 0: jdbc:drill:schema=dfs.tmp> alter session set > `store.json.all_text_mode`=true; > +-------+------------------------------------+ > | ok | summary | > +-------+------------------------------------+ > | true | store.json.all_text_mode updated. | > +-------+------------------------------------+ > 1 row selected (0.27 seconds) > 0: jdbc:drill:schema=dfs.tmp> SELECT c1 FROM `tooManyNulls.json` WHERE c1 IN > ('Hello World'); > +--------------+ > | c1 | > +--------------+ > | Hello World | > +--------------+ > 1 row selected (0.243 seconds) > 0: jdbc:drill:schema=dfs.tmp> select * FROM `tooManyNulls.json` WHERE c1 IN > ('Hello World'); > Error: SYSTEM ERROR: NumberFormatException: Hello World > Fragment 0:0 > [Error Id: 9cafb3f9-3d5c-478a-b55c-900602b8765e on centos-01.qa.lab:31010] > (java.lang.NumberFormatException) Hello World > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95 > > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt():120 > org.apache.drill.exec.test.generated.FiltererGen1169.doSetup():45 > org.apache.drill.exec.test.generated.FiltererGen1169.setup():54 > > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():195 > > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():107 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():415 > org.apache.hadoop.security.UserGroupInformation.doAs():1595 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> > {noformat} > Stack trace from drillbit.log > {noformat} > Caused by: java.lang.NumberFormatException: Hello World > at > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI(StringFunctionHelpers.java:95) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt(StringFunctionHelpers.java:120) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.test.generated.FiltererGen1169.doSetup(FilterTemplate2.java:45) > ~[na:na] > at > org.apache.drill.exec.test.generated.FiltererGen1169.setup(FilterTemplate2.java:54) > ~[na:na] > at > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer(FilterRecordBatch.java:195) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema(FilterRecordBatch.java:107) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:94) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:257) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:251) > ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)