Hello all,

As I understand, Drill by design is case-insensitive, w.r.t column names
within a table or file [1]. While this provides great flexibility and works
well with many data-sources, there are issues when working with
case-sensitive data-sources such as HBase / MapR-DB.

Consider the following JSON file:

{"_id": "ID1",
 *"Name"* : "ABC",
 "Age" : "25",
 "Phone" : null
}
{"_id": "ID2",
 *"name"* : "PQR",
 "Age" : "30",
 "Phone" : "408-123-456"
}
{"_id": "ID3",
 *"NAME"* : "XYZ",
 "Phone" : ""
}

Note that the case of the name field within the JSON file is of mixed-case.

>From Drill, while querying the JSON file directly (or corresponding content
in Parquet or Text formats), we get results which we as Drill users have
come to expect:

> select NAME from mfs.`/tmp/json/a.json`;
+-------+
| NAME  |
+-------+
| ABC   |
| PQR   |
| XYZ   |
+-------+


However, while querying a case-sensitive datasource (*with pushdown enabled*)
the following results are returned. The case provided in the query text is
honored and would determine the results. This could come as a *slight
surprise to certain Drill users* exploring/migrating to new Databases
(using new Storage / Format plugins within Drill)

> select *Name* from mfs.`/tmp/json/a`;
+-------+
| Name  |
+-------+
| ABC   |
+-------+

> select *name* from mfs.`/tmp/json/a`;
+-------+
| name  |
+-------+
| PQR   |
+-------+

> select *NAME* from mfs.`/tmp/json/a`;
+-------+
| NAME  |
+-------+
| XYZ   |
+-------+


> select *nAME* from mfs.`/tmp/json/a`;
+-------+
| nAME  |
+-------+
+-------+
No rows selected

There is no easy way to get all matching rows (irrespective of the case of
the column name). In the above example, the first row matching the provided
case is returned.


> select *Name, name, NAME* from mfs.`/tmp/json/a`;
+-------+--------+--------+
| Name  | name0  | NAME1  |
+-------+--------+--------+
| ABC   | ABC    | ABC    |
+-------+--------+--------+

> select *NAME, Name, name* from mfs.`/tmp/json/a`;
+-------+--------+--------+
| NAME  | Name0  | name1  |
+-------+--------+--------+
| XYZ   | XYZ    | XYZ    |
+-------+--------+--------+


If Pushdown features are disabled, the behavior seen above would indeed
match JSON files. However, this could come at a cost of not fully utilizing
the power of the underlying data-source, and could lead to performance
issues.

*In-consistent Results can happen when:*

(1) Dataset has mixed-cases for fields. Example seen above. While this
might not be very common, the concerns are still valid*, *since substantial
Drill users are exploring Drill for ETL cases where Data is not completely
sanitized.

(2) Data is consistent w.r.t case, but the query text has non-matching
case. While some could term this as user error, it could still cause issues
when users, applications or the underlying datasources change.

In both the above cases, Drill would silently perform the query and return
results which could be either *none, partial, complete/correct or entirely*
*wrong*.

Some specific questions:

(1) *Supporting Case-In-sensitive Behavior for Case-Sensitive Data-sources.
*For users who prefer the flexibility, how can Drill ensure that the
underlying data-source can return case-insensitive results.

(2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY support
case-sensitive behavior for data-sources. Users coming from case-sensitive
databases might want results matching the provided case. Example using the
above data:

> select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
+------+-------+--------+--------+

| _id  | Name  | name   | NAME   |
+------+-------+--------+--------+
| ID1  | ABC   | null   | null   |
+------+-------+--------+--------+
| ID2  | null  | PQR    | null   |
+------+-------+--------+--------+
| ID3  | null  | null   | XYZ    |
+------+-------+--------+--------+


(3) How does Drill currently work with *MongoDB*, which i guess is a
case-sensitive database? Have these issues ever been discussed previously?


Thanks in advance. I'd appreciate any helpful response.

Regards,
Abhishek


[1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity

Reply via email to