[ 
https://issues.apache.org/jira/browse/DRILL-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875144#comment-15875144
 ] 

ASF GitHub Bot commented on DRILL-4842:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/594
  
    Three general rules to keep in mind in the current JSON reader 
implementation:
    
    * Drill can remember the past. (Once a type as been seen for a column, 
Drill will remember that type.)
    * Drill cannot predict the future. (If a type has not been seen for a 
column by the end of a record batch, Drill cannot predict what type will appear 
in some later batch.)
    * Drill can amend the past within a single record batch. (If a batch starts 
with nulls, but later a type is seen, the previous values are automatically 
filled with nulls.)
    
    Actual implementation of the JSON reader, and the value writers that form 
the implementation, is complex. As we read JSON values, we ask a type-specific 
writer to set that value into the value vector. Each writer marks the column as 
non-null, then adds the value. Any values not so set will default to null.
    
    Consider a file with five null "c1" values followed by a string value "foo" 
for that field. The five nulls are ignored. When we see the non-null c1, the 
code creates a VarChar vector and sets the 6th value to the string "foo". Doing 
so automatically marks the previous five column values as null.
    
    Suppose we have a file with a single string value "foo" for column "c1", 
followed by five nulls. In this case, the first value creates and sets the 
VarChar vector as before. Later, at the end of reading the record batch, the 
reader sets the record count for the vectors. This action, on the VarChar 
vector, has the effect of setting the trailing five column values to null.
    
    Since values default to null, we get this behavior, and the previous, for 
free. The result is that if a record batch contains even a single non-null 
value for a field, that column will be fully populated with nulls for all other 
records in the same batch.
    
    This gets us back to the same old problem in Drill: if all we see are 
nulls, Drill needs to know, "null of what type" while in JSON the value is just 
null. The JIRA tickets linked to this ticket all related to that same 
underlying issue.
    
    There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479, 
DRILL-3806 and more.
    
    This fix affects only "all text mode." This means that, regardless of the 
JSON type, create a VarChar column. Doing so provides a very simple fix. Since 
all columns are VarChar, when we see a new column, with a null value, just 
create a VarChar column. (No need to set the column to null.)
    
    That is, we can "predict the future" for nulls because *all* columns are 
VarChar -- so there is not much to predict.
    
    Otherwise, we have to stick with Jacques' design decision in DRILL-1256: 
"Drill's perspective is a non-existent column and a column with no value are 
equivalent." A record batch of all nulls, followed by a record batch with a 
non-null value, will cause a schema change.
    
    Again, Drill needs a "null" type that is compatible with all other types in 
order to support JSON semantics. (And, needs to differentiate between 
value-exists-and-is-null and value-does-not-exist.)
    
    Yet another solution is to have the user tell us their intent. The [JSON 
Schema](http://jsonschema.net) project provides a way to express the expected 
schema so that Drill would know up front the type of each column (and whether 
the column is really nullable.)
    



> SELECT * on JSON data results in NumberFormatException
> ------------------------------------------------------
>
>                 Key: DRILL-4842
>                 URL: https://issues.apache.org/jira/browse/DRILL-4842
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.2.0
>            Reporter: Khurram Faraaz
>            Assignee: Serhii Harnyk
>         Attachments: tooManyNulls.json
>
>
> Note that doing SELECT c1 returns correct results, the failure is seen when 
> we do SELECT star. json.all_text_mode was set to true.
> JSON file tooManyNulls.json has one key c1 with 4096 nulls as its value and 
> the 4097th key c1 has the value "Hello World"
> git commit ID : aaf220ff
> MapR Drill 1.8.0 RPM
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> alter session set 
> `store.json.all_text_mode`=true;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | store.json.all_text_mode updated.  |
> +-------+------------------------------------+
> 1 row selected (0.27 seconds)
> 0: jdbc:drill:schema=dfs.tmp> SELECT c1 FROM `tooManyNulls.json` WHERE c1 IN 
> ('Hello World');
> +--------------+
> |      c1      |
> +--------------+
> | Hello World  |
> +--------------+
> 1 row selected (0.243 seconds)
> 0: jdbc:drill:schema=dfs.tmp> select * FROM `tooManyNulls.json` WHERE c1 IN 
> ('Hello World');
> Error: SYSTEM ERROR: NumberFormatException: Hello World
> Fragment 0:0
> [Error Id: 9cafb3f9-3d5c-478a-b55c-900602b8765e on centos-01.qa.lab:31010]
>  (java.lang.NumberFormatException) Hello World
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95
>     
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt():120
>     org.apache.drill.exec.test.generated.FiltererGen1169.doSetup():45
>     org.apache.drill.exec.test.generated.FiltererGen1169.setup():54
>     
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():195
>     
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():107
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>     
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():415
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>     java.lang.Thread.run():745 (state=,code=0)
> 0: jdbc:drill:schema=dfs.tmp>
> {noformat}
> Stack trace from drillbit.log
> {noformat}
> Caused by: java.lang.NumberFormatException: Hello World
>         at 
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI(StringFunctionHelpers.java:95)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varTypesToInt(StringFunctionHelpers.java:120)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.test.generated.FiltererGen1169.doSetup(FilterTemplate2.java:45)
>  ~[na:na]
>         at 
> org.apache.drill.exec.test.generated.FiltererGen1169.setup(FilterTemplate2.java:54)
>  ~[na:na]
>         at 
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer(FilterRecordBatch.java:195)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema(FilterRecordBatch.java:107)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:94)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) 
> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:257)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:251)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to