Re: Working with Case-Sensitive Data-sources

Jinfeng Ni Mon, 14 Mar 2016 14:44:22 -0700

Abhishek

Great question. Here is what I understand regarding the case sensitive policy.


Drill's case sensitivity policy (case insensitive and case preserving)
applies to the execution engine in Drill; it does not enforce the case
sensitivity policy to all the storage plugin. A storage plugin could
decide and implement it's own policy.

Why would the pushdown impact the case sensitivity when query HBase?
Without project pushdown, HBase storage plugin will return all the
data, and it's up to Drill's execution Project operator to apply the
case insensitive policy.  With the project pushdown, Drill will pass
the list of column names to HBase storage plugin, and HBase decides to
apply it's case sensitivity policy when scan the data.

Adding an option to make case sensitive storage plugin honor case
insensitive policy seems to be a good idea. The question is whether
the underneath storage (like HBase) will support such mode.






On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> wrote:
> Abhishek,
>
> I guess you're arguing that Drill's current behavior of honoring the case
> sensitive nature of the underlying data source (in this case, HBase and
> MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
> case insensitive behavior.
>
> I can see arguments both ways.
>
> But the part I think is confusing is that the behavior differs depending on
> whether or not projections and filters are pushed down to the data source.
> If the push down is done, then the behavior is case sensitive
> (corresponding to the data source).  But if pushdown doesn't happen, then
> the behavior is case insensitive.  That difference seems inconsistent and
> undesirable -- unless you argue that there are instances where you would
> want one behavior vs the other.  But it seems like that should be
> orthogonal and separate from whether pushdowns are applied.
>
> -- Zelaine
>
> On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]> wrote:
>
>> Hello all,
>>
>> As I understand, Drill by design is case-insensitive, w.r.t column names
>> within a table or file [1]. While this provides great flexibility and works
>> well with many data-sources, there are issues when working with
>> case-sensitive data-sources such as HBase / MapR-DB.
>>
>> Consider the following JSON file:
>>
>> {"_id": "ID1",
>>  *"Name"* : "ABC",
>>  "Age" : "25",
>>  "Phone" : null
>> }
>> {"_id": "ID2",
>>  *"name"* : "PQR",
>>  "Age" : "30",
>>  "Phone" : "408-123-456"
>> }
>> {"_id": "ID3",
>>  *"NAME"* : "XYZ",
>>  "Phone" : ""
>> }
>>
>> Note that the case of the name field within the JSON file is of mixed-case.
>>
>> From Drill, while querying the JSON file directly (or corresponding content
>> in Parquet or Text formats), we get results which we as Drill users have
>> come to expect:
>>
>> > select NAME from mfs.`/tmp/json/a.json`;
>> +-------+
>> | NAME  |
>> +-------+
>> | ABC   |
>> | PQR   |
>> | XYZ   |
>> +-------+
>>
>>
>> However, while querying a case-sensitive datasource (*with pushdown
>> enabled*)
>> the following results are returned. The case provided in the query text is
>> honored and would determine the results. This could come as a *slight
>> surprise to certain Drill users* exploring/migrating to new Databases
>> (using new Storage / Format plugins within Drill)
>>
>> > select *Name* from mfs.`/tmp/json/a`;
>> +-------+
>> | Name  |
>> +-------+
>> | ABC   |
>> +-------+
>>
>> > select *name* from mfs.`/tmp/json/a`;
>> +-------+
>> | name  |
>> +-------+
>> | PQR   |
>> +-------+
>>
>> > select *NAME* from mfs.`/tmp/json/a`;
>> +-------+
>> | NAME  |
>> +-------+
>> | XYZ   |
>> +-------+
>>
>>
>> > select *nAME* from mfs.`/tmp/json/a`;
>> +-------+
>> | nAME  |
>> +-------+
>> +-------+
>> No rows selected
>>
>> There is no easy way to get all matching rows (irrespective of the case of
>> the column name). In the above example, the first row matching the provided
>> case is returned.
>>
>>
>> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
>> +-------+--------+--------+
>> | Name  | name0  | NAME1  |
>> +-------+--------+--------+
>> | ABC   | ABC    | ABC    |
>> +-------+--------+--------+
>>
>> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
>> +-------+--------+--------+
>> | NAME  | Name0  | name1  |
>> +-------+--------+--------+
>> | XYZ   | XYZ    | XYZ    |
>> +-------+--------+--------+
>>
>>
>> If Pushdown features are disabled, the behavior seen above would indeed
>> match JSON files. However, this could come at a cost of not fully utilizing
>> the power of the underlying data-source, and could lead to performance
>> issues.
>>
>> *In-consistent Results can happen when:*
>>
>> (1) Dataset has mixed-cases for fields. Example seen above. While this
>> might not be very common, the concerns are still valid*, *since substantial
>> Drill users are exploring Drill for ETL cases where Data is not completely
>> sanitized.
>>
>> (2) Data is consistent w.r.t case, but the query text has non-matching
>> case. While some could term this as user error, it could still cause issues
>> when users, applications or the underlying datasources change.
>>
>> In both the above cases, Drill would silently perform the query and return
>> results which could be either *none, partial, complete/correct or entirely*
>> *wrong*.
>>
>> Some specific questions:
>>
>> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive Data-sources.
>> *For users who prefer the flexibility, how can Drill ensure that the
>> underlying data-source can return case-insensitive results.
>>
>> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY support
>> case-sensitive behavior for data-sources. Users coming from case-sensitive
>> databases might want results matching the provided case. Example using the
>> above data:
>>
>> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
>> +------+-------+--------+--------+
>>
>> | _id  | Name  | name   | NAME   |
>> +------+-------+--------+--------+
>> | ID1  | ABC   | null   | null   |
>> +------+-------+--------+--------+
>> | ID2  | null  | PQR    | null   |
>> +------+-------+--------+--------+
>> | ID3  | null  | null   | XYZ    |
>> +------+-------+--------+--------+
>>
>>
>> (3) How does Drill currently work with *MongoDB*, which i guess is a
>> case-sensitive database? Have these issues ever been discussed previously?
>>
>>
>> Thanks in advance. I'd appreciate any helpful response.
>>
>> Regards,
>> Abhishek
>>
>>
>> [1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity
>>

Re: Working with Case-Sensitive Data-sources

Reply via email to