Hi Marc,
Thanks for this. Here's the thing... Let's say you have json that looks like
this:
{
"foo":null
},{
"foo": 3.5
}
If you take the approach that `null` is treated like a string, you will get a
schema change exception when you read the next row. Our current approach is to
basically ignore fields that Drill cannot figure out what they are in terns of
data type. Once Drill encounters a data type, it will then assign a data type
to that column. See the example below which is from DRILL-5033. I added a
second row to demonstrate what happens once Drill is able to determine a data
type. Note that for the columns with a defined value in the second row, Drill
returns 'null' as the value.
[{
"intKey" : null,
"bgintKey": null,
"strKey": null,
"boolKey": null,
"fltKey": null,
"dblKey": null,
"timKey": null,
"dtKey": null,
"tmstmpKey": null,
"intrvldyKey": null,
"intrvlyrKey": null
},
{
"intKey" : 1,
"bgintKey": 3666565464,
"strKey": "hithere",
"boolKey": true,
"fltKey": 3.5,
"dblKey": 4.2,
"timKey": null,
"dtKey": null,
"tmstmpKey": null,
"intrvldyKey": null,
"intrvlyrKey": null
}]
select * from dfs.test.`nulls.json`;
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
| intKey | bgintKey | strKey | boolKey | fltKey | dblKey | timKey | dtKey
| tmstmpKey | intrvldyKey | intrvlyrKey |
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
| null | null | null | null | null | null | [] | []
| [] | [] | [] |
| 1.0 | 3.666565464E9 | hithere | true | 3.5 | 4.2 | [] | []
| [] | [] | [] |
+--------+---------------+---------+---------+--------+--------+--------+-------+-----------+-------------+-------------+
2 rows selected (0.232 seconds)
You are definitely welcome to submit a pull request, however this area is
extremely complex, and I'd suspect that what you propose will break other unit
tests. Another option which you might not be aware of is providing a schema.
If you do that from the beginning, then Drill will know what data types to
expect.
Best,
-- C
> On Dec 28, 2022, at 8:57 AM, marc nicole <[email protected]> wrote:
>
> Hello Drillers :)
>
> I came across the aforementioned bug (DRILL-5033) and wanted to contribute.
> My attempt is to consider a *null *token as a *string *and print the "null"
> as the column value instead of omitting the key in the output
> resultset, details
> of the fix attempt is below:
>
>
> *1)* In JsonReader.java (java-exec/drill-exec/vector/complex/fn/) at line
> 283 i add the following:
>
>> ...
>> case VALUE_NULL:
>> // handle null as string
>> handleString(parser, map, fieldName);
>> break;
>> ...
>
>
> *2)* then at line 415 the handleString() becomes:
>
> private void handleString(JsonParser parser, MapWriter writer, String
>> fieldName) throws IOException {
>> try {
>> // added the following if
>> if (parser.nextToken() == VALUE_NULL)
>> writer.varChar(fieldName)
>> .writeVarChar(0, workingBuffer.prepareVarCharHolder("null"),
>> workingBuffer.getBuf());
>> else
>> writer.varChar(fieldName)
>> .writeVarChar(0,
>> workingBuffer.prepareVarCharHolder(parser.getText()),
>> workingBuffer.getBuf());
>> } catch (IllegalArgumentException e) {
>> if (parser.getText() == null || parser.getText().isEmpty()) {
>> // return;
>> }
>> throw e;
>> }
>> }
>
>
>
> Is this a possible fix to the mentioned bug?
> If yes should i pull request ?
>
> Thanks.