How is this handled for MongoDB storage plugin, which I believe a case sensitive DB as well?
On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau <[email protected]> wrote: > I don't think it is that simple since there are some types of things that > we can't pushdown that will cause inconsistent results. > > For example, assuming that all values of x are positive, the following two > queries should return the same result > > select * from hbase where x = 5 > select * from hbase where abs(x) = 5 > > However, if the field x is sometimes 'x' and sometimes 'X', we're going to > different results between the first query and the second. That is why I > think we need to guarantee that even when optimization rules fails, we have > the same plan meaning. In essence, all plans should be valid. If you get to > a place where a rule changes the data, then the original plan was > effectively invalid. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <[email protected]> wrote: > > > Project pushdown should always happen. If you see project pushdown > > does not happen for your HBase query, then it's a bug. > > > > However, if you submit two physical plans, one with project pushdown, > > another one without project pushdown, but they return different > > results for HBase query. I'll not call this a bug. > > > > > > > > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]> > > wrote: > > > Agree with Zelaine, plan changes/optimizations shouldn't change > results. > > > This is a bug. > > > > > > Drill is focused on being case-insensitive, case-preserving. Each > storage > > > plugin implements its own case sensitivity policy when working with > > > columns/fields and should be documented. It isn't practical to make > HBase > > > case-insensitive so it should behave case sensitivity. DFS formats (as > > > opposed to HBase) are entirely under Drill's control and thus target > > > case-insensitive, case-preserving operation. > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]> > > wrote: > > > > > >> Abhishek > > >> > > >> Great question. Here is what I understand regarding the case sensitive > > >> policy. > > >> > > >> Drill's case sensitivity policy (case insensitive and case preserving) > > >> applies to the execution engine in Drill; it does not enforce the case > > >> sensitivity policy to all the storage plugin. A storage plugin could > > >> decide and implement it's own policy. > > >> > > >> Why would the pushdown impact the case sensitivity when query HBase? > > >> Without project pushdown, HBase storage plugin will return all the > > >> data, and it's up to Drill's execution Project operator to apply the > > >> case insensitive policy. With the project pushdown, Drill will pass > > >> the list of column names to HBase storage plugin, and HBase decides to > > >> apply it's case sensitivity policy when scan the data. > > >> > > >> Adding an option to make case sensitive storage plugin honor case > > >> insensitive policy seems to be a good idea. The question is whether > > >> the underneath storage (like HBase) will support such mode. > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> > > wrote: > > >> > Abhishek, > > >> > > > >> > I guess you're arguing that Drill's current behavior of honoring the > > case > > >> > sensitive nature of the underlying data source (in this case, HBase > > and > > >> > MapR-DB) will be confusing for Drill users who are accustomed to > > Drill's > > >> > case insensitive behavior. > > >> > > > >> > I can see arguments both ways. > > >> > > > >> > But the part I think is confusing is that the behavior differs > > depending > > >> on > > >> > whether or not projections and filters are pushed down to the data > > >> source. > > >> > If the push down is done, then the behavior is case sensitive > > >> > (corresponding to the data source). But if pushdown doesn't happen, > > then > > >> > the behavior is case insensitive. That difference seems > inconsistent > > and > > >> > undesirable -- unless you argue that there are instances where you > > would > > >> > want one behavior vs the other. But it seems like that should be > > >> > orthogonal and separate from whether pushdowns are applied. > > >> > > > >> > -- Zelaine > > >> > > > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]> > > >> wrote: > > >> > > > >> >> Hello all, > > >> >> > > >> >> As I understand, Drill by design is case-insensitive, w.r.t column > > names > > >> >> within a table or file [1]. While this provides great flexibility > and > > >> works > > >> >> well with many data-sources, there are issues when working with > > >> >> case-sensitive data-sources such as HBase / MapR-DB. > > >> >> > > >> >> Consider the following JSON file: > > >> >> > > >> >> {"_id": "ID1", > > >> >> *"Name"* : "ABC", > > >> >> "Age" : "25", > > >> >> "Phone" : null > > >> >> } > > >> >> {"_id": "ID2", > > >> >> *"name"* : "PQR", > > >> >> "Age" : "30", > > >> >> "Phone" : "408-123-456" > > >> >> } > > >> >> {"_id": "ID3", > > >> >> *"NAME"* : "XYZ", > > >> >> "Phone" : "" > > >> >> } > > >> >> > > >> >> Note that the case of the name field within the JSON file is of > > >> mixed-case. > > >> >> > > >> >> From Drill, while querying the JSON file directly (or corresponding > > >> content > > >> >> in Parquet or Text formats), we get results which we as Drill users > > have > > >> >> come to expect: > > >> >> > > >> >> > select NAME from mfs.`/tmp/json/a.json`; > > >> >> +-------+ > > >> >> | NAME | > > >> >> +-------+ > > >> >> | ABC | > > >> >> | PQR | > > >> >> | XYZ | > > >> >> +-------+ > > >> >> > > >> >> > > >> >> However, while querying a case-sensitive datasource (*with pushdown > > >> >> enabled*) > > >> >> the following results are returned. The case provided in the query > > text > > >> is > > >> >> honored and would determine the results. This could come as a > *slight > > >> >> surprise to certain Drill users* exploring/migrating to new > Databases > > >> >> (using new Storage / Format plugins within Drill) > > >> >> > > >> >> > select *Name* from mfs.`/tmp/json/a`; > > >> >> +-------+ > > >> >> | Name | > > >> >> +-------+ > > >> >> | ABC | > > >> >> +-------+ > > >> >> > > >> >> > select *name* from mfs.`/tmp/json/a`; > > >> >> +-------+ > > >> >> | name | > > >> >> +-------+ > > >> >> | PQR | > > >> >> +-------+ > > >> >> > > >> >> > select *NAME* from mfs.`/tmp/json/a`; > > >> >> +-------+ > > >> >> | NAME | > > >> >> +-------+ > > >> >> | XYZ | > > >> >> +-------+ > > >> >> > > >> >> > > >> >> > select *nAME* from mfs.`/tmp/json/a`; > > >> >> +-------+ > > >> >> | nAME | > > >> >> +-------+ > > >> >> +-------+ > > >> >> No rows selected > > >> >> > > >> >> There is no easy way to get all matching rows (irrespective of the > > case > > >> of > > >> >> the column name). In the above example, the first row matching the > > >> provided > > >> >> case is returned. > > >> >> > > >> >> > > >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`; > > >> >> +-------+--------+--------+ > > >> >> | Name | name0 | NAME1 | > > >> >> +-------+--------+--------+ > > >> >> | ABC | ABC | ABC | > > >> >> +-------+--------+--------+ > > >> >> > > >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`; > > >> >> +-------+--------+--------+ > > >> >> | NAME | Name0 | name1 | > > >> >> +-------+--------+--------+ > > >> >> | XYZ | XYZ | XYZ | > > >> >> +-------+--------+--------+ > > >> >> > > >> >> > > >> >> If Pushdown features are disabled, the behavior seen above would > > indeed > > >> >> match JSON files. However, this could come at a cost of not fully > > >> utilizing > > >> >> the power of the underlying data-source, and could lead to > > performance > > >> >> issues. > > >> >> > > >> >> *In-consistent Results can happen when:* > > >> >> > > >> >> (1) Dataset has mixed-cases for fields. Example seen above. While > > this > > >> >> might not be very common, the concerns are still valid*, *since > > >> substantial > > >> >> Drill users are exploring Drill for ETL cases where Data is not > > >> completely > > >> >> sanitized. > > >> >> > > >> >> (2) Data is consistent w.r.t case, but the query text has > > non-matching > > >> >> case. While some could term this as user error, it could still > cause > > >> issues > > >> >> when users, applications or the underlying datasources change. > > >> >> > > >> >> In both the above cases, Drill would silently perform the query and > > >> return > > >> >> results which could be either *none, partial, complete/correct or > > >> entirely* > > >> >> *wrong*. > > >> >> > > >> >> Some specific questions: > > >> >> > > >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive > > >> Data-sources. > > >> >> *For users who prefer the flexibility, how can Drill ensure that > the > > >> >> underlying data-source can return case-insensitive results. > > >> >> > > >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY > > >> support > > >> >> case-sensitive behavior for data-sources. Users coming from > > >> case-sensitive > > >> >> databases might want results matching the provided case. Example > > using > > >> the > > >> >> above data: > > >> >> > > >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`; > > >> >> +------+-------+--------+--------+ > > >> >> > > >> >> | _id | Name | name | NAME | > > >> >> +------+-------+--------+--------+ > > >> >> | ID1 | ABC | null | null | > > >> >> +------+-------+--------+--------+ > > >> >> | ID2 | null | PQR | null | > > >> >> +------+-------+--------+--------+ > > >> >> | ID3 | null | null | XYZ | > > >> >> +------+-------+--------+--------+ > > >> >> > > >> >> > > >> >> (3) How does Drill currently work with *MongoDB*, which i guess is > a > > >> >> case-sensitive database? Have these issues ever been discussed > > >> previously? > > >> >> > > >> >> > > >> >> Thanks in advance. I'd appreciate any helpful response. > > >> >> > > >> >> Regards, > > >> >> Abhishek > > >> >> > > >> >> > > >> >> [1] > > https://drill.apache.org/docs/lexical-structure/#case-sensitivity > > >> >> > > >> > > >
