Re: Working with Case-Sensitive Data-sources

Jacques Nadeau Mon, 14 Mar 2016 16:54:06 -0700

I believe it also suffers from the same issues.

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Mon, Mar 14, 2016 at 4:29 PM, Neeraja Rentachintala <
[email protected]> wrote:

> How is this handled for MongoDB storage plugin, which I believe a case
> sensitive DB as well?
>
> On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > I don't think it is that simple since there are some types of things that
> > we can't pushdown that will cause inconsistent results.
> >
> > For example, assuming that all values of x are positive, the following
> two
> > queries should return the same result
> >
> > select * from hbase where x = 5
> > select * from hbase where abs(x) = 5
> >
> > However, if the field x is sometimes 'x' and sometimes 'X', we're going
> to
> > different results between the first query and the second. That is why I
> > think we need to guarantee that even when optimization rules fails, we
> have
> > the same plan meaning. In essence, all plans should be valid. If you get
> to
> > a place where a rule changes the data, then the original plan was
> > effectively invalid.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> > > Project pushdown should always happen. If you see project pushdown
> > > does not happen for your HBase query, then it's a bug.
> > >
> > > However, if you submit two physical plans, one with project pushdown,
> > > another one without project pushdown, but they return different
> > > results for HBase query. I'll not call this a bug.
> > >
> > >
> > >
> > > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > > > Agree with Zelaine, plan changes/optimizations shouldn't change
> > results.
> > > > This is a bug.
> > > >
> > > > Drill is focused on being case-insensitive, case-preserving. Each
> > storage
> > > > plugin implements its own case sensitivity policy when working with
> > > > columns/fields and should be documented. It isn't practical to make
> > HBase
> > > > case-insensitive so it should behave case sensitivity. DFS formats
> (as
> > > > opposed to HBase) are entirely under Drill's control and thus target
> > > > case-insensitive, case-preserving operation.
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]>
> > > wrote:
> > > >
> > > >> Abhishek
> > > >>
> > > >> Great question. Here is what I understand regarding the case
> sensitive
> > > >> policy.
> > > >>
> > > >> Drill's case sensitivity policy (case insensitive and case
> preserving)
> > > >> applies to the execution engine in Drill; it does not enforce the
> case
> > > >> sensitivity policy to all the storage plugin. A storage plugin could
> > > >> decide and implement it's own policy.
> > > >>
> > > >> Why would the pushdown impact the case sensitivity when query HBase?
> > > >> Without project pushdown, HBase storage plugin will return all the
> > > >> data, and it's up to Drill's execution Project operator to apply the
> > > >> case insensitive policy.  With the project pushdown, Drill will pass
> > > >> the list of column names to HBase storage plugin, and HBase decides
> to
> > > >> apply it's case sensitivity policy when scan the data.
> > > >>
> > > >> Adding an option to make case sensitive storage plugin honor case
> > > >> insensitive policy seems to be a good idea. The question is whether
> > > >> the underneath storage (like HBase) will support such mode.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]>
> > > wrote:
> > > >> > Abhishek,
> > > >> >
> > > >> > I guess you're arguing that Drill's current behavior of honoring
> the
> > > case
> > > >> > sensitive nature of the underlying data source (in this case,
> HBase
> > > and
> > > >> > MapR-DB) will be confusing for Drill users who are accustomed to
> > > Drill's
> > > >> > case insensitive behavior.
> > > >> >
> > > >> > I can see arguments both ways.
> > > >> >
> > > >> > But the part I think is confusing is that the behavior differs
> > > depending
> > > >> on
> > > >> > whether or not projections and filters are pushed down to the data
> > > >> source.
> > > >> > If the push down is done, then the behavior is case sensitive
> > > >> > (corresponding to the data source).  But if pushdown doesn't
> happen,
> > > then
> > > >> > the behavior is case insensitive.  That difference seems
> > inconsistent
> > > and
> > > >> > undesirable -- unless you argue that there are instances where you
> > > would
> > > >> > want one behavior vs the other.  But it seems like that should be
> > > >> > orthogonal and separate from whether pushdowns are applied.
> > > >> >
> > > >> > -- Zelaine
> > > >> >
> > > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <
> [email protected]>
> > > >> wrote:
> > > >> >
> > > >> >> Hello all,
> > > >> >>
> > > >> >> As I understand, Drill by design is case-insensitive, w.r.t
> column
> > > names
> > > >> >> within a table or file [1]. While this provides great flexibility
> > and
> > > >> works
> > > >> >> well with many data-sources, there are issues when working with
> > > >> >> case-sensitive data-sources such as HBase / MapR-DB.
> > > >> >>
> > > >> >> Consider the following JSON file:
> > > >> >>
> > > >> >> {"_id": "ID1",
> > > >> >>  *"Name"* : "ABC",
> > > >> >>  "Age" : "25",
> > > >> >>  "Phone" : null
> > > >> >> }
> > > >> >> {"_id": "ID2",
> > > >> >>  *"name"* : "PQR",
> > > >> >>  "Age" : "30",
> > > >> >>  "Phone" : "408-123-456"
> > > >> >> }
> > > >> >> {"_id": "ID3",
> > > >> >>  *"NAME"* : "XYZ",
> > > >> >>  "Phone" : ""
> > > >> >> }
> > > >> >>
> > > >> >> Note that the case of the name field within the JSON file is of
> > > >> mixed-case.
> > > >> >>
> > > >> >> From Drill, while querying the JSON file directly (or
> corresponding
> > > >> content
> > > >> >> in Parquet or Text formats), we get results which we as Drill
> users
> > > have
> > > >> >> come to expect:
> > > >> >>
> > > >> >> > select NAME from mfs.`/tmp/json/a.json`;
> > > >> >> +-------+
> > > >> >> | NAME  |
> > > >> >> +-------+
> > > >> >> | ABC   |
> > > >> >> | PQR   |
> > > >> >> | XYZ   |
> > > >> >> +-------+
> > > >> >>
> > > >> >>
> > > >> >> However, while querying a case-sensitive datasource (*with
> pushdown
> > > >> >> enabled*)
> > > >> >> the following results are returned. The case provided in the
> query
> > > text
> > > >> is
> > > >> >> honored and would determine the results. This could come as a
> > *slight
> > > >> >> surprise to certain Drill users* exploring/migrating to new
> > Databases
> > > >> >> (using new Storage / Format plugins within Drill)
> > > >> >>
> > > >> >> > select *Name* from mfs.`/tmp/json/a`;
> > > >> >> +-------+
> > > >> >> | Name  |
> > > >> >> +-------+
> > > >> >> | ABC   |
> > > >> >> +-------+
> > > >> >>
> > > >> >> > select *name* from mfs.`/tmp/json/a`;
> > > >> >> +-------+
> > > >> >> | name  |
> > > >> >> +-------+
> > > >> >> | PQR   |
> > > >> >> +-------+
> > > >> >>
> > > >> >> > select *NAME* from mfs.`/tmp/json/a`;
> > > >> >> +-------+
> > > >> >> | NAME  |
> > > >> >> +-------+
> > > >> >> | XYZ   |
> > > >> >> +-------+
> > > >> >>
> > > >> >>
> > > >> >> > select *nAME* from mfs.`/tmp/json/a`;
> > > >> >> +-------+
> > > >> >> | nAME  |
> > > >> >> +-------+
> > > >> >> +-------+
> > > >> >> No rows selected
> > > >> >>
> > > >> >> There is no easy way to get all matching rows (irrespective of
> the
> > > case
> > > >> of
> > > >> >> the column name). In the above example, the first row matching
> the
> > > >> provided
> > > >> >> case is returned.
> > > >> >>
> > > >> >>
> > > >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
> > > >> >> +-------+--------+--------+
> > > >> >> | Name  | name0  | NAME1  |
> > > >> >> +-------+--------+--------+
> > > >> >> | ABC   | ABC    | ABC    |
> > > >> >> +-------+--------+--------+
> > > >> >>
> > > >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
> > > >> >> +-------+--------+--------+
> > > >> >> | NAME  | Name0  | name1  |
> > > >> >> +-------+--------+--------+
> > > >> >> | XYZ   | XYZ    | XYZ    |
> > > >> >> +-------+--------+--------+
> > > >> >>
> > > >> >>
> > > >> >> If Pushdown features are disabled, the behavior seen above would
> > > indeed
> > > >> >> match JSON files. However, this could come at a cost of not fully
> > > >> utilizing
> > > >> >> the power of the underlying data-source, and could lead to
> > > performance
> > > >> >> issues.
> > > >> >>
> > > >> >> *In-consistent Results can happen when:*
> > > >> >>
> > > >> >> (1) Dataset has mixed-cases for fields. Example seen above. While
> > > this
> > > >> >> might not be very common, the concerns are still valid*, *since
> > > >> substantial
> > > >> >> Drill users are exploring Drill for ETL cases where Data is not
> > > >> completely
> > > >> >> sanitized.
> > > >> >>
> > > >> >> (2) Data is consistent w.r.t case, but the query text has
> > > non-matching
> > > >> >> case. While some could term this as user error, it could still
> > cause
> > > >> issues
> > > >> >> when users, applications or the underlying datasources change.
> > > >> >>
> > > >> >> In both the above cases, Drill would silently perform the query
> and
> > > >> return
> > > >> >> results which could be either *none, partial, complete/correct or
> > > >> entirely*
> > > >> >> *wrong*.
> > > >> >>
> > > >> >> Some specific questions:
> > > >> >>
> > > >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive
> > > >> Data-sources.
> > > >> >> *For users who prefer the flexibility, how can Drill ensure that
> > the
> > > >> >> underlying data-source can return case-insensitive results.
> > > >> >>
> > > >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill
> OPTIONALLY
> > > >> support
> > > >> >> case-sensitive behavior for data-sources. Users coming from
> > > >> case-sensitive
> > > >> >> databases might want results matching the provided case. Example
> > > using
> > > >> the
> > > >> >> above data:
> > > >> >>
> > > >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
> > > >> >> +------+-------+--------+--------+
> > > >> >>
> > > >> >> | _id  | Name  | name   | NAME   |
> > > >> >> +------+-------+--------+--------+
> > > >> >> | ID1  | ABC   | null   | null   |
> > > >> >> +------+-------+--------+--------+
> > > >> >> | ID2  | null  | PQR    | null   |
> > > >> >> +------+-------+--------+--------+
> > > >> >> | ID3  | null  | null   | XYZ    |
> > > >> >> +------+-------+--------+--------+
> > > >> >>
> > > >> >>
> > > >> >> (3) How does Drill currently work with *MongoDB*, which i guess
> is
> > a
> > > >> >> case-sensitive database? Have these issues ever been discussed
> > > >> previously?
> > > >> >>
> > > >> >>
> > > >> >> Thanks in advance. I'd appreciate any helpful response.
> > > >> >>
> > > >> >> Regards,
> > > >> >> Abhishek
> > > >> >>
> > > >> >>
> > > >> >> [1]
> > > https://drill.apache.org/docs/lexical-structure/#case-sensitivity
> > > >> >>
> > > >>
> > >
> >
>

Re: Working with Case-Sensitive Data-sources

Reply via email to