Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Aditya
 arguments both ways.
> > >> >
> > >> > But the part I think is confusing is that the behavior differs
> > depending
> > >> on
> > >> > whether or not projections and filters are pushed down to the data
> > >> source.
> > >> > If the push down is done, then the behavior is case sensitive
> > >> > (corresponding to the data source).  But if pushdown doesn't happen,
> > then
> > >> > the behavior is case insensitive.  That difference seems
> inconsistent
> > and
> > >> > undesirable -- unless you argue that there are instances where you
> > would
> > >> > want one behavior vs the other.  But it seems like that should be
> > >> > orthogonal and separate from whether pushdowns are applied.
> > >> >
> > >> > -- Zelaine
> > >> >
> > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
> > >> wrote:
> > >> >
> > >> >> Hello all,
> > >> >>
> > >> >> As I understand, Drill by design is case-insensitive, w.r.t column
> > names
> > >> >> within a table or file [1]. While this provides great flexibility
> and
> > >> works
> > >> >> well with many data-sources, there are issues when working with
> > >> >> case-sensitive data-sources such as HBase / MapR-DB.
> > >> >>
> > >> >> Consider the following JSON file:
> > >> >>
> > >> >> {"_id": "ID1",
> > >> >>  *"Name"* : "ABC",
> > >> >>  "Age" : "25",
> > >> >>  "Phone" : null
> > >> >> }
> > >> >> {"_id": "ID2",
> > >> >>  *"name"* : "PQR",
> > >> >>  "Age" : "30",
> > >> >>  "Phone" : "408-123-456"
> > >> >> }
> > >> >> {"_id": "ID3",
> > >> >>  *"NAME"* : "XYZ",
> > >> >>  "Phone" : ""
> > >> >> }
> > >> >>
> > >> >> Note that the case of the name field within the JSON file is of
> > >> mixed-case.
> > >> >>
> > >> >> From Drill, while querying the JSON file directly (or corresponding
> > >> content
> > >> >> in Parquet or Text formats), we get results which we as Drill users
> > have
> > >> >> come to expect:
> > >> >>
> > >> >> > select NAME from mfs.`/tmp/json/a.json`;
> > >> >> +---+
> > >> >> | NAME  |
> > >> >> +---+
> > >> >> | ABC   |
> > >> >> | PQR   |
> > >> >> | XYZ   |
> > >> >> +---+
> > >> >>
> > >> >>
> > >> >> However, while querying a case-sensitive datasource (*with pushdown
> > >> >> enabled*)
> > >> >> the following results are returned. The case provided in the query
> > text
> > >> is
> > >> >> honored and would determine the results. This could come as a
> *slight
> > >> >> surprise to certain Drill users* exploring/migrating to new
> Databases
> > >> >> (using new Storage / Format plugins within Drill)
> > >> >>
> > >> >> > select *Name* from mfs.`/tmp/json/a`;
> > >> >> +---+
> > >> >> | Name  |
> > >> >> +---+
> > >> >> | ABC   |
> > >> >> +---+
> > >> >>
> > >> >> > select *name* from mfs.`/tmp/json/a`;
> > >> >> +---+
> > >> >> | name  |
> > >> >> +---+
> > >> >> | PQR   |
> > >> >> +---+
> > >> >>
> > >> >> > select *NAME* from mfs.`/tmp/json/a`;
> > >> >> +---+
> > >> >> | NAME  |
> > >> >> +---+
> > >> >> | XYZ   |
> > >> >> +---+
> > >> >>
> > >> >>
> > >> >> > select *nAME* from mfs.`/tmp/json/a`;
> > >> >> +---+
> >

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
> > MapR-DB) will be confusing for Drill users who are accustomed to
> > > Drill's
> > > >> > case insensitive behavior.
> > > >> >
> > > >> > I can see arguments both ways.
> > > >> >
> > > >> > But the part I think is confusing is that the behavior differs
> > > depending
> > > >> on
> > > >> > whether or not projections and filters are pushed down to the data
> > > >> source.
> > > >> > If the push down is done, then the behavior is case sensitive
> > > >> > (corresponding to the data source).  But if pushdown doesn't
> happen,
> > > then
> > > >> > the behavior is case insensitive.  That difference seems
> > inconsistent
> > > and
> > > >> > undesirable -- unless you argue that there are instances where you
> > > would
> > > >> > want one behavior vs the other.  But it seems like that should be
> > > >> > orthogonal and separate from whether pushdowns are applied.
> > > >> >
> > > >> > -- Zelaine
> > > >> >
> > > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <
> agir...@mapr.com>
> > > >> wrote:
> > > >> >
> > > >> >> Hello all,
> > > >> >>
> > > >> >> As I understand, Drill by design is case-insensitive, w.r.t
> column
> > > names
> > > >> >> within a table or file [1]. While this provides great flexibility
> > and
> > > >> works
> > > >> >> well with many data-sources, there are issues when working with
> > > >> >> case-sensitive data-sources such as HBase / MapR-DB.
> > > >> >>
> > > >> >> Consider the following JSON file:
> > > >> >>
> > > >> >> {"_id": "ID1",
> > > >> >>  *"Name"* : "ABC",
> > > >> >>  "Age" : "25",
> > > >> >>  "Phone" : null
> > > >> >> }
> > > >> >> {"_id": "ID2",
> > > >> >>  *"name"* : "PQR",
> > > >> >>  "Age" : "30",
> > > >> >>  "Phone" : "408-123-456"
> > > >> >> }
> > > >> >> {"_id": "ID3",
> > > >> >>  *"NAME"* : "XYZ",
> > > >> >>  "Phone" : ""
> > > >> >> }
> > > >> >>
> > > >> >> Note that the case of the name field within the JSON file is of
> > > >> mixed-case.
> > > >> >>
> > > >> >> From Drill, while querying the JSON file directly (or
> corresponding
> > > >> content
> > > >> >> in Parquet or Text formats), we get results which we as Drill
> users
> > > have
> > > >> >> come to expect:
> > > >> >>
> > > >> >> > select NAME from mfs.`/tmp/json/a.json`;
> > > >> >> +---+
> > > >> >> | NAME  |
> > > >> >> +---+
> > > >> >> | ABC   |
> > > >> >> | PQR   |
> > > >> >> | XYZ   |
> > > >> >> +---+
> > > >> >>
> > > >> >>
> > > >> >> However, while querying a case-sensitive datasource (*with
> pushdown
> > > >> >> enabled*)
> > > >> >> the following results are returned. The case provided in the
> query
> > > text
> > > >> is
> > > >> >> honored and would determine the results. This could come as a
> > *slight
> > > >> >> surprise to certain Drill users* exploring/migrating to new
> > Databases
> > > >> >> (using new Storage / Format plugins within Drill)
> > > >> >>
> > > >> >> > select *Name* from mfs.`/tmp/json/a`;
> > > >> >> +---+
> > > >> >> | Name  |
> > > >> >> +---+
> > > >> >> | ABC   |
> > > >> >> +---+
> > > >> >>
> > > >> >> > select

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Neeraja Rentachintala
How is this handled for MongoDB storage plugin, which I believe a case
sensitive DB as well?

On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau  wrote:

> I don't think it is that simple since there are some types of things that
> we can't pushdown that will cause inconsistent results.
>
> For example, assuming that all values of x are positive, the following two
> queries should return the same result
>
> select * from hbase where x = 5
> select * from hbase where abs(x) = 5
>
> However, if the field x is sometimes 'x' and sometimes 'X', we're going to
> different results between the first query and the second. That is why I
> think we need to guarantee that even when optimization rules fails, we have
> the same plan meaning. In essence, all plans should be valid. If you get to
> a place where a rule changes the data, then the original plan was
> effectively invalid.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni  wrote:
>
> > Project pushdown should always happen. If you see project pushdown
> > does not happen for your HBase query, then it's a bug.
> >
> > However, if you submit two physical plans, one with project pushdown,
> > another one without project pushdown, but they return different
> > results for HBase query. I'll not call this a bug.
> >
> >
> >
> > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau 
> > wrote:
> > > Agree with Zelaine, plan changes/optimizations shouldn't change
> results.
> > > This is a bug.
> > >
> > > Drill is focused on being case-insensitive, case-preserving. Each
> storage
> > > plugin implements its own case sensitivity policy when working with
> > > columns/fields and should be documented. It isn't practical to make
> HBase
> > > case-insensitive so it should behave case sensitivity. DFS formats (as
> > > opposed to HBase) are entirely under Drill's control and thus target
> > > case-insensitive, case-preserving operation.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni 
> > wrote:
> > >
> > >> Abhishek
> > >>
> > >> Great question. Here is what I understand regarding the case sensitive
> > >> policy.
> > >>
> > >> Drill's case sensitivity policy (case insensitive and case preserving)
> > >> applies to the execution engine in Drill; it does not enforce the case
> > >> sensitivity policy to all the storage plugin. A storage plugin could
> > >> decide and implement it's own policy.
> > >>
> > >> Why would the pushdown impact the case sensitivity when query HBase?
> > >> Without project pushdown, HBase storage plugin will return all the
> > >> data, and it's up to Drill's execution Project operator to apply the
> > >> case insensitive policy.  With the project pushdown, Drill will pass
> > >> the list of column names to HBase storage plugin, and HBase decides to
> > >> apply it's case sensitivity policy when scan the data.
> > >>
> > >> Adding an option to make case sensitive storage plugin honor case
> > >> insensitive policy seems to be a good idea. The question is whether
> > >> the underneath storage (like HBase) will support such mode.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong 
> > wrote:
> > >> > Abhishek,
> > >> >
> > >> > I guess you're arguing that Drill's current behavior of honoring the
> > case
> > >> > sensitive nature of the underlying data source (in this case, HBase
> > and
> > >> > MapR-DB) will be confusing for Drill users who are accustomed to
> > Drill's
> > >> > case insensitive behavior.
> > >> >
> > >> > I can see arguments both ways.
> > >> >
> > >> > But the part I think is confusing is that the behavior differs
> > depending
> > >> on
> > >> > whether or not projections and filters are pushed down to the data
> > >> source.
> > >> > If the push down is done, then the behavior is case sensitive
> > >> > (corresponding to the data source).  But if pushdown doesn't happen,
> > then
> > >>

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
I don't think it is that simple since there are some types of things that
we can't pushdown that will cause inconsistent results.

For example, assuming that all values of x are positive, the following two
queries should return the same result

select * from hbase where x = 5
select * from hbase where abs(x) = 5

However, if the field x is sometimes 'x' and sometimes 'X', we're going to
different results between the first query and the second. That is why I
think we need to guarantee that even when optimization rules fails, we have
the same plan meaning. In essence, all plans should be valid. If you get to
a place where a rule changes the data, then the original plan was
effectively invalid.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni  wrote:

> Project pushdown should always happen. If you see project pushdown
> does not happen for your HBase query, then it's a bug.
>
> However, if you submit two physical plans, one with project pushdown,
> another one without project pushdown, but they return different
> results for HBase query. I'll not call this a bug.
>
>
>
> On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau 
> wrote:
> > Agree with Zelaine, plan changes/optimizations shouldn't change results.
> > This is a bug.
> >
> > Drill is focused on being case-insensitive, case-preserving. Each storage
> > plugin implements its own case sensitivity policy when working with
> > columns/fields and should be documented. It isn't practical to make HBase
> > case-insensitive so it should behave case sensitivity. DFS formats (as
> > opposed to HBase) are entirely under Drill's control and thus target
> > case-insensitive, case-preserving operation.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni 
> wrote:
> >
> >> Abhishek
> >>
> >> Great question. Here is what I understand regarding the case sensitive
> >> policy.
> >>
> >> Drill's case sensitivity policy (case insensitive and case preserving)
> >> applies to the execution engine in Drill; it does not enforce the case
> >> sensitivity policy to all the storage plugin. A storage plugin could
> >> decide and implement it's own policy.
> >>
> >> Why would the pushdown impact the case sensitivity when query HBase?
> >> Without project pushdown, HBase storage plugin will return all the
> >> data, and it's up to Drill's execution Project operator to apply the
> >> case insensitive policy.  With the project pushdown, Drill will pass
> >> the list of column names to HBase storage plugin, and HBase decides to
> >> apply it's case sensitivity policy when scan the data.
> >>
> >> Adding an option to make case sensitive storage plugin honor case
> >> insensitive policy seems to be a good idea. The question is whether
> >> the underneath storage (like HBase) will support such mode.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong 
> wrote:
> >> > Abhishek,
> >> >
> >> > I guess you're arguing that Drill's current behavior of honoring the
> case
> >> > sensitive nature of the underlying data source (in this case, HBase
> and
> >> > MapR-DB) will be confusing for Drill users who are accustomed to
> Drill's
> >> > case insensitive behavior.
> >> >
> >> > I can see arguments both ways.
> >> >
> >> > But the part I think is confusing is that the behavior differs
> depending
> >> on
> >> > whether or not projections and filters are pushed down to the data
> >> source.
> >> > If the push down is done, then the behavior is case sensitive
> >> > (corresponding to the data source).  But if pushdown doesn't happen,
> then
> >> > the behavior is case insensitive.  That difference seems inconsistent
> and
> >> > undesirable -- unless you argue that there are instances where you
> would
> >> > want one behavior vs the other.  But it seems like that should be
> >> > orthogonal and separate from whether pushdowns are applied.
> >> >
> >> > -- Zelaine
> >> >
> >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
> >> wrote:
> >> >
> >> >> Hello all,
> >> >>
> >> >> As I understand, Drill by design is case-insensitive, w.r.t colum

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jinfeng Ni
Project pushdown should always happen. If you see project pushdown
does not happen for your HBase query, then it's a bug.

However, if you submit two physical plans, one with project pushdown,
another one without project pushdown, but they return different
results for HBase query. I'll not call this a bug.



On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau  wrote:
> Agree with Zelaine, plan changes/optimizations shouldn't change results.
> This is a bug.
>
> Drill is focused on being case-insensitive, case-preserving. Each storage
> plugin implements its own case sensitivity policy when working with
> columns/fields and should be documented. It isn't practical to make HBase
> case-insensitive so it should behave case sensitivity. DFS formats (as
> opposed to HBase) are entirely under Drill's control and thus target
> case-insensitive, case-preserving operation.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni  wrote:
>
>> Abhishek
>>
>> Great question. Here is what I understand regarding the case sensitive
>> policy.
>>
>> Drill's case sensitivity policy (case insensitive and case preserving)
>> applies to the execution engine in Drill; it does not enforce the case
>> sensitivity policy to all the storage plugin. A storage plugin could
>> decide and implement it's own policy.
>>
>> Why would the pushdown impact the case sensitivity when query HBase?
>> Without project pushdown, HBase storage plugin will return all the
>> data, and it's up to Drill's execution Project operator to apply the
>> case insensitive policy.  With the project pushdown, Drill will pass
>> the list of column names to HBase storage plugin, and HBase decides to
>> apply it's case sensitivity policy when scan the data.
>>
>> Adding an option to make case sensitive storage plugin honor case
>> insensitive policy seems to be a good idea. The question is whether
>> the underneath storage (like HBase) will support such mode.
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong  wrote:
>> > Abhishek,
>> >
>> > I guess you're arguing that Drill's current behavior of honoring the case
>> > sensitive nature of the underlying data source (in this case, HBase and
>> > MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
>> > case insensitive behavior.
>> >
>> > I can see arguments both ways.
>> >
>> > But the part I think is confusing is that the behavior differs depending
>> on
>> > whether or not projections and filters are pushed down to the data
>> source.
>> > If the push down is done, then the behavior is case sensitive
>> > (corresponding to the data source).  But if pushdown doesn't happen, then
>> > the behavior is case insensitive.  That difference seems inconsistent and
>> > undesirable -- unless you argue that there are instances where you would
>> > want one behavior vs the other.  But it seems like that should be
>> > orthogonal and separate from whether pushdowns are applied.
>> >
>> > -- Zelaine
>> >
>> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
>> wrote:
>> >
>> >> Hello all,
>> >>
>> >> As I understand, Drill by design is case-insensitive, w.r.t column names
>> >> within a table or file [1]. While this provides great flexibility and
>> works
>> >> well with many data-sources, there are issues when working with
>> >> case-sensitive data-sources such as HBase / MapR-DB.
>> >>
>> >> Consider the following JSON file:
>> >>
>> >> {"_id": "ID1",
>> >>  *"Name"* : "ABC",
>> >>  "Age" : "25",
>> >>  "Phone" : null
>> >> }
>> >> {"_id": "ID2",
>> >>  *"name"* : "PQR",
>> >>  "Age" : "30",
>> >>  "Phone" : "408-123-456"
>> >> }
>> >> {"_id": "ID3",
>> >>  *"NAME"* : "XYZ",
>> >>  "Phone" : ""
>> >> }
>> >>
>> >> Note that the case of the name field within the JSON file is of
>> mixed-case.
>> >>
>> >> From Drill, while querying the JSON file directly (or corresponding
>> content
>> >> in Parquet or Text

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
Agree with Zelaine, plan changes/optimizations shouldn't change results.
This is a bug.

Drill is focused on being case-insensitive, case-preserving. Each storage
plugin implements its own case sensitivity policy when working with
columns/fields and should be documented. It isn't practical to make HBase
case-insensitive so it should behave case sensitivity. DFS formats (as
opposed to HBase) are entirely under Drill's control and thus target
case-insensitive, case-preserving operation.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni  wrote:

> Abhishek
>
> Great question. Here is what I understand regarding the case sensitive
> policy.
>
> Drill's case sensitivity policy (case insensitive and case preserving)
> applies to the execution engine in Drill; it does not enforce the case
> sensitivity policy to all the storage plugin. A storage plugin could
> decide and implement it's own policy.
>
> Why would the pushdown impact the case sensitivity when query HBase?
> Without project pushdown, HBase storage plugin will return all the
> data, and it's up to Drill's execution Project operator to apply the
> case insensitive policy.  With the project pushdown, Drill will pass
> the list of column names to HBase storage plugin, and HBase decides to
> apply it's case sensitivity policy when scan the data.
>
> Adding an option to make case sensitive storage plugin honor case
> insensitive policy seems to be a good idea. The question is whether
> the underneath storage (like HBase) will support such mode.
>
>
>
>
>
>
> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong  wrote:
> > Abhishek,
> >
> > I guess you're arguing that Drill's current behavior of honoring the case
> > sensitive nature of the underlying data source (in this case, HBase and
> > MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
> > case insensitive behavior.
> >
> > I can see arguments both ways.
> >
> > But the part I think is confusing is that the behavior differs depending
> on
> > whether or not projections and filters are pushed down to the data
> source.
> > If the push down is done, then the behavior is case sensitive
> > (corresponding to the data source).  But if pushdown doesn't happen, then
> > the behavior is case insensitive.  That difference seems inconsistent and
> > undesirable -- unless you argue that there are instances where you would
> > want one behavior vs the other.  But it seems like that should be
> > orthogonal and separate from whether pushdowns are applied.
> >
> > -- Zelaine
> >
> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
> wrote:
> >
> >> Hello all,
> >>
> >> As I understand, Drill by design is case-insensitive, w.r.t column names
> >> within a table or file [1]. While this provides great flexibility and
> works
> >> well with many data-sources, there are issues when working with
> >> case-sensitive data-sources such as HBase / MapR-DB.
> >>
> >> Consider the following JSON file:
> >>
> >> {"_id": "ID1",
> >>  *"Name"* : "ABC",
> >>  "Age" : "25",
> >>  "Phone" : null
> >> }
> >> {"_id": "ID2",
> >>  *"name"* : "PQR",
> >>  "Age" : "30",
> >>  "Phone" : "408-123-456"
> >> }
> >> {"_id": "ID3",
> >>  *"NAME"* : "XYZ",
> >>  "Phone" : ""
> >> }
> >>
> >> Note that the case of the name field within the JSON file is of
> mixed-case.
> >>
> >> From Drill, while querying the JSON file directly (or corresponding
> content
> >> in Parquet or Text formats), we get results which we as Drill users have
> >> come to expect:
> >>
> >> > select NAME from mfs.`/tmp/json/a.json`;
> >> +---+
> >> | NAME  |
> >> +---+
> >> | ABC   |
> >> | PQR   |
> >> | XYZ   |
> >> +---+
> >>
> >>
> >> However, while querying a case-sensitive datasource (*with pushdown
> >> enabled*)
> >> the following results are returned. The case provided in the query text
> is
> >> honored and would determine the results. This could come as a *slight
> >> surprise to certain Drill users* exploring/migrating to new Databases
> >> (using new Storage / Format

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jinfeng Ni
Abhishek

Great question. Here is what I understand regarding the case sensitive policy.

Drill's case sensitivity policy (case insensitive and case preserving)
applies to the execution engine in Drill; it does not enforce the case
sensitivity policy to all the storage plugin. A storage plugin could
decide and implement it's own policy.

Why would the pushdown impact the case sensitivity when query HBase?
Without project pushdown, HBase storage plugin will return all the
data, and it's up to Drill's execution Project operator to apply the
case insensitive policy.  With the project pushdown, Drill will pass
the list of column names to HBase storage plugin, and HBase decides to
apply it's case sensitivity policy when scan the data.

Adding an option to make case sensitive storage plugin honor case
insensitive policy seems to be a good idea. The question is whether
the underneath storage (like HBase) will support such mode.






On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong  wrote:
> Abhishek,
>
> I guess you're arguing that Drill's current behavior of honoring the case
> sensitive nature of the underlying data source (in this case, HBase and
> MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
> case insensitive behavior.
>
> I can see arguments both ways.
>
> But the part I think is confusing is that the behavior differs depending on
> whether or not projections and filters are pushed down to the data source.
> If the push down is done, then the behavior is case sensitive
> (corresponding to the data source).  But if pushdown doesn't happen, then
> the behavior is case insensitive.  That difference seems inconsistent and
> undesirable -- unless you argue that there are instances where you would
> want one behavior vs the other.  But it seems like that should be
> orthogonal and separate from whether pushdowns are applied.
>
> -- Zelaine
>
> On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish  wrote:
>
>> Hello all,
>>
>> As I understand, Drill by design is case-insensitive, w.r.t column names
>> within a table or file [1]. While this provides great flexibility and works
>> well with many data-sources, there are issues when working with
>> case-sensitive data-sources such as HBase / MapR-DB.
>>
>> Consider the following JSON file:
>>
>> {"_id": "ID1",
>>  *"Name"* : "ABC",
>>  "Age" : "25",
>>  "Phone" : null
>> }
>> {"_id": "ID2",
>>  *"name"* : "PQR",
>>  "Age" : "30",
>>  "Phone" : "408-123-456"
>> }
>> {"_id": "ID3",
>>  *"NAME"* : "XYZ",
>>  "Phone" : ""
>> }
>>
>> Note that the case of the name field within the JSON file is of mixed-case.
>>
>> From Drill, while querying the JSON file directly (or corresponding content
>> in Parquet or Text formats), we get results which we as Drill users have
>> come to expect:
>>
>> > select NAME from mfs.`/tmp/json/a.json`;
>> +---+
>> | NAME  |
>> +---+
>> | ABC   |
>> | PQR   |
>> | XYZ   |
>> +---+
>>
>>
>> However, while querying a case-sensitive datasource (*with pushdown
>> enabled*)
>> the following results are returned. The case provided in the query text is
>> honored and would determine the results. This could come as a *slight
>> surprise to certain Drill users* exploring/migrating to new Databases
>> (using new Storage / Format plugins within Drill)
>>
>> > select *Name* from mfs.`/tmp/json/a`;
>> +---+
>> | Name  |
>> +---+
>> | ABC   |
>> +---+
>>
>> > select *name* from mfs.`/tmp/json/a`;
>> +---+
>> | name  |
>> +---+
>> | PQR   |
>> +---+
>>
>> > select *NAME* from mfs.`/tmp/json/a`;
>> +---+
>> | NAME  |
>> +---+
>> | XYZ   |
>> +---+
>>
>>
>> > select *nAME* from mfs.`/tmp/json/a`;
>> +---+
>> | nAME  |
>> +---+
>> +---+
>> No rows selected
>>
>> There is no easy way to get all matching rows (irrespective of the case of
>> the column name). In the above example, the first row matching the provided
>> case is returned.
>>
>>
>> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
>> +---+++
>> | Name  | name0  | NAME1  |
>> +---+++
>> | ABC   | ABC| ABC|
>> +---+---

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Zelaine Fong
Abhishek,

I guess you're arguing that Drill's current behavior of honoring the case
sensitive nature of the underlying data source (in this case, HBase and
MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
case insensitive behavior.

I can see arguments both ways.

But the part I think is confusing is that the behavior differs depending on
whether or not projections and filters are pushed down to the data source.
If the push down is done, then the behavior is case sensitive
(corresponding to the data source).  But if pushdown doesn't happen, then
the behavior is case insensitive.  That difference seems inconsistent and
undesirable -- unless you argue that there are instances where you would
want one behavior vs the other.  But it seems like that should be
orthogonal and separate from whether pushdowns are applied.

-- Zelaine

On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish  wrote:

> Hello all,
>
> As I understand, Drill by design is case-insensitive, w.r.t column names
> within a table or file [1]. While this provides great flexibility and works
> well with many data-sources, there are issues when working with
> case-sensitive data-sources such as HBase / MapR-DB.
>
> Consider the following JSON file:
>
> {"_id": "ID1",
>  *"Name"* : "ABC",
>  "Age" : "25",
>  "Phone" : null
> }
> {"_id": "ID2",
>  *"name"* : "PQR",
>  "Age" : "30",
>  "Phone" : "408-123-456"
> }
> {"_id": "ID3",
>  *"NAME"* : "XYZ",
>  "Phone" : ""
> }
>
> Note that the case of the name field within the JSON file is of mixed-case.
>
> From Drill, while querying the JSON file directly (or corresponding content
> in Parquet or Text formats), we get results which we as Drill users have
> come to expect:
>
> > select NAME from mfs.`/tmp/json/a.json`;
> +---+
> | NAME  |
> +---+
> | ABC   |
> | PQR   |
> | XYZ   |
> +---+
>
>
> However, while querying a case-sensitive datasource (*with pushdown
> enabled*)
> the following results are returned. The case provided in the query text is
> honored and would determine the results. This could come as a *slight
> surprise to certain Drill users* exploring/migrating to new Databases
> (using new Storage / Format plugins within Drill)
>
> > select *Name* from mfs.`/tmp/json/a`;
> +---+
> | Name  |
> +---+
> | ABC   |
> +---+
>
> > select *name* from mfs.`/tmp/json/a`;
> +---+
> | name  |
> +---+
> | PQR   |
> +---+
>
> > select *NAME* from mfs.`/tmp/json/a`;
> +---+
> | NAME  |
> +---+
> | XYZ   |
> +---+
>
>
> > select *nAME* from mfs.`/tmp/json/a`;
> +---+
> | nAME  |
> +---+
> +---+
> No rows selected
>
> There is no easy way to get all matching rows (irrespective of the case of
> the column name). In the above example, the first row matching the provided
> case is returned.
>
>
> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
> +---+++
> | Name  | name0  | NAME1  |
> +---+++
> | ABC   | ABC| ABC|
> +---+++
>
> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
> +---+++
> | NAME  | Name0  | name1  |
> +---+++
> | XYZ   | XYZ| XYZ|
> +---+++
>
>
> If Pushdown features are disabled, the behavior seen above would indeed
> match JSON files. However, this could come at a cost of not fully utilizing
> the power of the underlying data-source, and could lead to performance
> issues.
>
> *In-consistent Results can happen when:*
>
> (1) Dataset has mixed-cases for fields. Example seen above. While this
> might not be very common, the concerns are still valid*, *since substantial
> Drill users are exploring Drill for ETL cases where Data is not completely
> sanitized.
>
> (2) Data is consistent w.r.t case, but the query text has non-matching
> case. While some could term this as user error, it could still cause issues
> when users, applications or the underlying datasources change.
>
> In both the above cases, Drill would silently perform the query and return
> results which could be either *none, partial, complete/correct or entirely*
> *wrong*.
>
> Some specific questions:
>
> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive Data-sources.
> *For users who prefer the flexibility, how can Drill ensure that the
> underlying data-source can return case-insensi

Working with Case-Sensitive Data-sources

2016-03-14 Thread Abhishek Girish
Hello all,

As I understand, Drill by design is case-insensitive, w.r.t column names
within a table or file [1]. While this provides great flexibility and works
well with many data-sources, there are issues when working with
case-sensitive data-sources such as HBase / MapR-DB.

Consider the following JSON file:

{"_id": "ID1",
 *"Name"* : "ABC",
 "Age" : "25",
 "Phone" : null
}
{"_id": "ID2",
 *"name"* : "PQR",
 "Age" : "30",
 "Phone" : "408-123-456"
}
{"_id": "ID3",
 *"NAME"* : "XYZ",
 "Phone" : ""
}

Note that the case of the name field within the JSON file is of mixed-case.

>From Drill, while querying the JSON file directly (or corresponding content
in Parquet or Text formats), we get results which we as Drill users have
come to expect:

> select NAME from mfs.`/tmp/json/a.json`;
+---+
| NAME  |
+---+
| ABC   |
| PQR   |
| XYZ   |
+---+


However, while querying a case-sensitive datasource (*with pushdown enabled*)
the following results are returned. The case provided in the query text is
honored and would determine the results. This could come as a *slight
surprise to certain Drill users* exploring/migrating to new Databases
(using new Storage / Format plugins within Drill)

> select *Name* from mfs.`/tmp/json/a`;
+---+
| Name  |
+---+
| ABC   |
+---+

> select *name* from mfs.`/tmp/json/a`;
+---+
| name  |
+---+
| PQR   |
+---+

> select *NAME* from mfs.`/tmp/json/a`;
+---+
| NAME  |
+---+
| XYZ   |
+---+


> select *nAME* from mfs.`/tmp/json/a`;
+---+
| nAME  |
+---+
+---+
No rows selected

There is no easy way to get all matching rows (irrespective of the case of
the column name). In the above example, the first row matching the provided
case is returned.


> select *Name, name, NAME* from mfs.`/tmp/json/a`;
+---+++
| Name  | name0  | NAME1  |
+---+++
| ABC   | ABC| ABC|
+---+++

> select *NAME, Name, name* from mfs.`/tmp/json/a`;
+---+++
| NAME  | Name0  | name1  |
+---+++
| XYZ   | XYZ| XYZ|
+---+++


If Pushdown features are disabled, the behavior seen above would indeed
match JSON files. However, this could come at a cost of not fully utilizing
the power of the underlying data-source, and could lead to performance
issues.

*In-consistent Results can happen when:*

(1) Dataset has mixed-cases for fields. Example seen above. While this
might not be very common, the concerns are still valid*, *since substantial
Drill users are exploring Drill for ETL cases where Data is not completely
sanitized.

(2) Data is consistent w.r.t case, but the query text has non-matching
case. While some could term this as user error, it could still cause issues
when users, applications or the underlying datasources change.

In both the above cases, Drill would silently perform the query and return
results which could be either *none, partial, complete/correct or entirely*
*wrong*.

Some specific questions:

(1) *Supporting Case-In-sensitive Behavior for Case-Sensitive Data-sources.
*For users who prefer the flexibility, how can Drill ensure that the
underlying data-source can return case-insensitive results.

(2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY support
case-sensitive behavior for data-sources. Users coming from case-sensitive
databases might want results matching the provided case. Example using the
above data:

> select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
+--+---+++

| _id  | Name  | name   | NAME   |
+--+---+++
| ID1  | ABC   | null   | null   |
+--+---+++
| ID2  | null  | PQR| null   |
+--+---+++
| ID3  | null  | null   | XYZ|
+--+---+++


(3) How does Drill currently work with *MongoDB*, which i guess is a
case-sensitive database? Have these issues ever been discussed previously?


Thanks in advance. I'd appreciate any helpful response.

Regards,
Abhishek


[1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity