Re: About a "DRILL-5033" fix

Charles Givre Wed, 28 Dec 2022 06:46:58 -0800

Hi Marc, 
Thanks for this.  Here's the thing... Let's say you have json that looks like 
this:


{
        "foo":null
},{
        "foo": 3.5
}

If you take the approach that `null` is treated like a string, you will get a 
schema change exception when you read the next row.  Our current approach is to 
basically ignore fields that Drill cannot figure out what they are in terns of 
data type.  Once Drill encounters a data type, it will then assign a data type 
to that column.  See the example below which is from DRILL-5033.  I added a 
second row to demonstrate what happens once Drill is able to determine a data 
type.  Note that for the columns with a defined value in the second row, Drill 
returns 'null' as the value. 


[{
"intKey" : null,
"bgintKey": null,
"strKey": null,
"boolKey": null,
"fltKey": null,
"dblKey": null,
"timKey": null,
"dtKey": null,
"tmstmpKey": null,
"intrvldyKey": null,
"intrvlyrKey": null
},
{
"intKey" : 1,
"bgintKey": 3666565464,
"strKey": "hithere",
"boolKey": true,
"fltKey": 3.5,
"dblKey": 4.2,
"timKey": null,
"dtKey": null,
"tmstmpKey": null,
"intrvldyKey": null,
"intrvlyrKey": null
}]


select * from dfs.test.`nulls.json`;
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
| intKey |   bgintKey    | strKey  | boolKey | fltKey | dblKey | timKey | dtKey 
| tmstmpKey | intrvldyKey | intrvlyrKey |
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
| null   | null          | null    | null    | null   | null   | []     | []    
| []        | []          | []          |
| 1.0    | 3.666565464E9 | hithere | true    | 3.5    | 4.2    | []     | []    
| []        | []          | []          |
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
2 rows selected (0.232 seconds)

You are definitely welcome to submit a pull request, however this area is 
extremely complex, and I'd suspect that what you propose will break other unit 
tests.  Another option which you might not be aware of is providing a schema.  
If you do that from the beginning, then Drill will know what data types to 
expect. 

Best,
-- C


> On Dec 28, 2022, at 8:57 AM, marc nicole <[email protected]> wrote:
> 
> Hello Drillers :)
> 
> I came across the aforementioned bug (DRILL-5033) and wanted to contribute.
> My attempt is to consider a *null *token as a *string *and print the "null"
> as the column value instead of omitting the key in the output
> resultset, details
> of the fix attempt is below:
> 
> 
> *1)* In JsonReader.java (java-exec/drill-exec/vector/complex/fn/) at line
> 283 i add the following:
> 
>> ...
>> case VALUE_NULL:
>>          // handle null as string
>>          handleString(parser, map, fieldName);
>>          break;
>> ...
> 
> 
> *2)* then at line 415 the handleString() becomes:
> 
> private void handleString(JsonParser parser, MapWriter writer, String
>> fieldName) throws IOException {
>>    try {
>>     // added the following if
>>      if (parser.nextToken() == VALUE_NULL)
>>        writer.varChar(fieldName)
>>          .writeVarChar(0, workingBuffer.prepareVarCharHolder("null"),
>> workingBuffer.getBuf());
>>      else
>>      writer.varChar(fieldName)
>>          .writeVarChar(0,
>> workingBuffer.prepareVarCharHolder(parser.getText()),
>> workingBuffer.getBuf());
>>    } catch (IllegalArgumentException e) {
>>      if (parser.getText() == null || parser.getText().isEmpty()) {
>>       // return;
>>      }
>>      throw e;
>>    }
>>  }
> 
> 
> 
> Is this a possible fix to the mentioned bug?
> If yes should i pull request ?
> 
> Thanks.

Re: About a "DRILL-5033" fix

Reply via email to