I believe it also suffers from the same issues. -- Jacques Nadeau CTO and Co-Founder, Dremio
On Mon, Mar 14, 2016 at 4:29 PM, Neeraja Rentachintala < [email protected]> wrote: > How is this handled for MongoDB storage plugin, which I believe a case > sensitive DB as well? > > On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau <[email protected]> > wrote: > > > I don't think it is that simple since there are some types of things that > > we can't pushdown that will cause inconsistent results. > > > > For example, assuming that all values of x are positive, the following > two > > queries should return the same result > > > > select * from hbase where x = 5 > > select * from hbase where abs(x) = 5 > > > > However, if the field x is sometimes 'x' and sometimes 'X', we're going > to > > different results between the first query and the second. That is why I > > think we need to guarantee that even when optimization rules fails, we > have > > the same plan meaning. In essence, all plans should be valid. If you get > to > > a place where a rule changes the data, then the original plan was > > effectively invalid. > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <[email protected]> > wrote: > > > > > Project pushdown should always happen. If you see project pushdown > > > does not happen for your HBase query, then it's a bug. > > > > > > However, if you submit two physical plans, one with project pushdown, > > > another one without project pushdown, but they return different > > > results for HBase query. I'll not call this a bug. > > > > > > > > > > > > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]> > > > wrote: > > > > Agree with Zelaine, plan changes/optimizations shouldn't change > > results. > > > > This is a bug. > > > > > > > > Drill is focused on being case-insensitive, case-preserving. Each > > storage > > > > plugin implements its own case sensitivity policy when working with > > > > columns/fields and should be documented. It isn't practical to make > > HBase > > > > case-insensitive so it should behave case sensitivity. DFS formats > (as > > > > opposed to HBase) are entirely under Drill's control and thus target > > > > case-insensitive, case-preserving operation. > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]> > > > wrote: > > > > > > > >> Abhishek > > > >> > > > >> Great question. Here is what I understand regarding the case > sensitive > > > >> policy. > > > >> > > > >> Drill's case sensitivity policy (case insensitive and case > preserving) > > > >> applies to the execution engine in Drill; it does not enforce the > case > > > >> sensitivity policy to all the storage plugin. A storage plugin could > > > >> decide and implement it's own policy. > > > >> > > > >> Why would the pushdown impact the case sensitivity when query HBase? > > > >> Without project pushdown, HBase storage plugin will return all the > > > >> data, and it's up to Drill's execution Project operator to apply the > > > >> case insensitive policy. With the project pushdown, Drill will pass > > > >> the list of column names to HBase storage plugin, and HBase decides > to > > > >> apply it's case sensitivity policy when scan the data. > > > >> > > > >> Adding an option to make case sensitive storage plugin honor case > > > >> insensitive policy seems to be a good idea. The question is whether > > > >> the underneath storage (like HBase) will support such mode. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> > > > wrote: > > > >> > Abhishek, > > > >> > > > > >> > I guess you're arguing that Drill's current behavior of honoring > the > > > case > > > >> > sensitive nature of the underlying data source (in this case, > HBase > > > and > > > >> > MapR-DB) will be confusing for Drill users who are accustomed to > > > Drill's > > > >> > case insensitive behavior. > > > >> > > > > >> > I can see arguments both ways. > > > >> > > > > >> > But the part I think is confusing is that the behavior differs > > > depending > > > >> on > > > >> > whether or not projections and filters are pushed down to the data > > > >> source. > > > >> > If the push down is done, then the behavior is case sensitive > > > >> > (corresponding to the data source). But if pushdown doesn't > happen, > > > then > > > >> > the behavior is case insensitive. That difference seems > > inconsistent > > > and > > > >> > undesirable -- unless you argue that there are instances where you > > > would > > > >> > want one behavior vs the other. But it seems like that should be > > > >> > orthogonal and separate from whether pushdowns are applied. > > > >> > > > > >> > -- Zelaine > > > >> > > > > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish < > [email protected]> > > > >> wrote: > > > >> > > > > >> >> Hello all, > > > >> >> > > > >> >> As I understand, Drill by design is case-insensitive, w.r.t > column > > > names > > > >> >> within a table or file [1]. While this provides great flexibility > > and > > > >> works > > > >> >> well with many data-sources, there are issues when working with > > > >> >> case-sensitive data-sources such as HBase / MapR-DB. > > > >> >> > > > >> >> Consider the following JSON file: > > > >> >> > > > >> >> {"_id": "ID1", > > > >> >> *"Name"* : "ABC", > > > >> >> "Age" : "25", > > > >> >> "Phone" : null > > > >> >> } > > > >> >> {"_id": "ID2", > > > >> >> *"name"* : "PQR", > > > >> >> "Age" : "30", > > > >> >> "Phone" : "408-123-456" > > > >> >> } > > > >> >> {"_id": "ID3", > > > >> >> *"NAME"* : "XYZ", > > > >> >> "Phone" : "" > > > >> >> } > > > >> >> > > > >> >> Note that the case of the name field within the JSON file is of > > > >> mixed-case. > > > >> >> > > > >> >> From Drill, while querying the JSON file directly (or > corresponding > > > >> content > > > >> >> in Parquet or Text formats), we get results which we as Drill > users > > > have > > > >> >> come to expect: > > > >> >> > > > >> >> > select NAME from mfs.`/tmp/json/a.json`; > > > >> >> +-------+ > > > >> >> | NAME | > > > >> >> +-------+ > > > >> >> | ABC | > > > >> >> | PQR | > > > >> >> | XYZ | > > > >> >> +-------+ > > > >> >> > > > >> >> > > > >> >> However, while querying a case-sensitive datasource (*with > pushdown > > > >> >> enabled*) > > > >> >> the following results are returned. The case provided in the > query > > > text > > > >> is > > > >> >> honored and would determine the results. This could come as a > > *slight > > > >> >> surprise to certain Drill users* exploring/migrating to new > > Databases > > > >> >> (using new Storage / Format plugins within Drill) > > > >> >> > > > >> >> > select *Name* from mfs.`/tmp/json/a`; > > > >> >> +-------+ > > > >> >> | Name | > > > >> >> +-------+ > > > >> >> | ABC | > > > >> >> +-------+ > > > >> >> > > > >> >> > select *name* from mfs.`/tmp/json/a`; > > > >> >> +-------+ > > > >> >> | name | > > > >> >> +-------+ > > > >> >> | PQR | > > > >> >> +-------+ > > > >> >> > > > >> >> > select *NAME* from mfs.`/tmp/json/a`; > > > >> >> +-------+ > > > >> >> | NAME | > > > >> >> +-------+ > > > >> >> | XYZ | > > > >> >> +-------+ > > > >> >> > > > >> >> > > > >> >> > select *nAME* from mfs.`/tmp/json/a`; > > > >> >> +-------+ > > > >> >> | nAME | > > > >> >> +-------+ > > > >> >> +-------+ > > > >> >> No rows selected > > > >> >> > > > >> >> There is no easy way to get all matching rows (irrespective of > the > > > case > > > >> of > > > >> >> the column name). In the above example, the first row matching > the > > > >> provided > > > >> >> case is returned. > > > >> >> > > > >> >> > > > >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`; > > > >> >> +-------+--------+--------+ > > > >> >> | Name | name0 | NAME1 | > > > >> >> +-------+--------+--------+ > > > >> >> | ABC | ABC | ABC | > > > >> >> +-------+--------+--------+ > > > >> >> > > > >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`; > > > >> >> +-------+--------+--------+ > > > >> >> | NAME | Name0 | name1 | > > > >> >> +-------+--------+--------+ > > > >> >> | XYZ | XYZ | XYZ | > > > >> >> +-------+--------+--------+ > > > >> >> > > > >> >> > > > >> >> If Pushdown features are disabled, the behavior seen above would > > > indeed > > > >> >> match JSON files. However, this could come at a cost of not fully > > > >> utilizing > > > >> >> the power of the underlying data-source, and could lead to > > > performance > > > >> >> issues. > > > >> >> > > > >> >> *In-consistent Results can happen when:* > > > >> >> > > > >> >> (1) Dataset has mixed-cases for fields. Example seen above. While > > > this > > > >> >> might not be very common, the concerns are still valid*, *since > > > >> substantial > > > >> >> Drill users are exploring Drill for ETL cases where Data is not > > > >> completely > > > >> >> sanitized. > > > >> >> > > > >> >> (2) Data is consistent w.r.t case, but the query text has > > > non-matching > > > >> >> case. While some could term this as user error, it could still > > cause > > > >> issues > > > >> >> when users, applications or the underlying datasources change. > > > >> >> > > > >> >> In both the above cases, Drill would silently perform the query > and > > > >> return > > > >> >> results which could be either *none, partial, complete/correct or > > > >> entirely* > > > >> >> *wrong*. > > > >> >> > > > >> >> Some specific questions: > > > >> >> > > > >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive > > > >> Data-sources. > > > >> >> *For users who prefer the flexibility, how can Drill ensure that > > the > > > >> >> underlying data-source can return case-insensitive results. > > > >> >> > > > >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill > OPTIONALLY > > > >> support > > > >> >> case-sensitive behavior for data-sources. Users coming from > > > >> case-sensitive > > > >> >> databases might want results matching the provided case. Example > > > using > > > >> the > > > >> >> above data: > > > >> >> > > > >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`; > > > >> >> +------+-------+--------+--------+ > > > >> >> > > > >> >> | _id | Name | name | NAME | > > > >> >> +------+-------+--------+--------+ > > > >> >> | ID1 | ABC | null | null | > > > >> >> +------+-------+--------+--------+ > > > >> >> | ID2 | null | PQR | null | > > > >> >> +------+-------+--------+--------+ > > > >> >> | ID3 | null | null | XYZ | > > > >> >> +------+-------+--------+--------+ > > > >> >> > > > >> >> > > > >> >> (3) How does Drill currently work with *MongoDB*, which i guess > is > > a > > > >> >> case-sensitive database? Have these issues ever been discussed > > > >> previously? > > > >> >> > > > >> >> > > > >> >> Thanks in advance. I'd appreciate any helpful response. > > > >> >> > > > >> >> Regards, > > > >> >> Abhishek > > > >> >> > > > >> >> > > > >> >> [1] > > > https://drill.apache.org/docs/lexical-structure/#case-sensitivity > > > >> >> > > > >> > > > > > >
