Hanumath Rao Maduri created DRILL-7113:
------------------------------------------

             Summary: Issue with filtering null values from MapRDB-JSON
                 Key: DRILL-7113
                 URL: https://issues.apache.org/jira/browse/DRILL-7113
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 1.15.0
            Reporter: Hanumath Rao Maduri
            Assignee: Aman Sinha
             Fix For: 1.16.0, 1.17.0


When the Drill is querying documents from MapRDBJSON that contain fields with 
null value, it returns the wrong result.
 The issue is locally reproduced.

Please find the repro steps:
 [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'.

[2] Insert the following sample records to table:
{code:java}
insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", 
"confidence": 0.24}'
insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}'
insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", 
"confidence": 0.54}'
insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", 
"confidence": null}'
{code}
We can see that for field 'confidence' document 1 has value 0.24, document 3 
has value 0.54, document 2 does not have the field and document 4 has the field 
with value null.

[3] Query the table from DRILL.
 *Query 1:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person   | 0.24        |
| person2  | null        |
| person3  | 0.54        |
| person4  | null        |
+----------+-------------+
4 rows selected (0.2 seconds)

{code}
*Query 2:*
{code:java}
0: jdbc:drill:> select * from dfs.tmp.dmdb2;
+------+-------------+----------+
| _id  | confidence  |  label   |
+------+-------------+----------+
| 1    | 0.24        | person   |
| 2    | null        | person2  |
| 3    | 0.54        | person3  |
| 4    | null        | person4  |
+------+-------------+----------+
4 rows selected (0.174 seconds)

{code}
*Query 3:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence is 
not null;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person   | 0.24        |
| person3  | 0.54        |
| person4  | null        |
+----------+-------------+
3 rows selected (0.192 seconds)

{code}
*Query 4:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence is  
null;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person2  | null        |
+----------+-------------+
1 row selected (0.262 seconds)

{code}
As you can see, Query 3 which queries for all documents with confidence value 
'is not null', returns a document with null value.

*Other observation:*
 Querying the same data using DRILL without MapRDB provides the correct result.
 For example, create 4 different JSON files with following data:

{"label": "person", "confidence": 0.24} \{"label": "person2"} \{"label": 
"person3", "confidence": 0.54} \{"label": "person4", "confidence": null}

Query it directly using DRILL:

*Query 5:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.t2;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person4  | null        |
| person3  | 0.54        |
| person2  | null        |
| person   | 0.24        |
+----------+-------------+
4 rows selected (0.203 seconds)

{code}
*Query 6:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is 
null;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person4  | null        |
| person2  | null        |
+----------+-------------+
2 rows selected (0.352 seconds)

{code}
*Query 7:*
{code:java}
0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is not 
null;
+----------+-------------+
|  label   | confidence  |
+----------+-------------+
| person3  | 0.54        |
| person   | 0.24        |
+----------+-------------+
2 rows selected (0.265 seconds)

{code}
As seen in query 6 & 7, it returns the correct result.

I believe the issue is at the MapRDB layer where it is fetching the results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to