date:20150423

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Steven Phillips

Allow me to summarize my previous response:

+1 adding filename
+0 dir array
-1 for not including in *

On Thu, Apr 23, 2015 at 7:28 PM, Tomer Shiran  wrote:

> +1 to adding the filename (needed this last week, I had .json
> files and wanted to join with another table)
> +1 to using an array dirs[]
> +1 to not having it in * (but would "select dirs, *" work?)
>
>
>
> > On Apr 23, 2015, at 7:00 PM, Steven Phillips 
> wrote:
> >
> > What you are showing for the current behavior seems wrong to me:
> >
> > $ tree mytdir
> > mytdir
> > └── mysdir
> >└── myFile.json
> >
> > $ cat mytdir/mysdir/myFile.json
> > {a:1,b:2,c:3}
> > {a:4,b:5,c:6}
> >
> > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> > ++++
> > | a  | b  | c  |
> > ++++
> > | 1  | 2  | 3  |
> > | 4  | 5  | 6  |
> > ++++
> > 2 rows selected (0.274 seconds)
> > 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> > ++++
> > | a  | b  | c  |
> > ++++
> > | 1  | 2  | 3  |
> > | 4  | 5  | 6  |
> > ++++
> > 2 rows selected (0.152 seconds)
> > 0: jdbc:drill:> select * from `/mytdir/mysdir`;
> > ++++
> > | a  | b  | c  |
> > ++++
> > | 1  | 2  | 3  |
> > | 4  | 5  | 6  |
> > ++++
> > 2 rows selected (0.157 seconds)
> > 0: jdbc:drill:> select * from `mytdir`;
> > +++++
> > |dir0| a  | b  | c  |
> > +++++
> > | mysdir | 1  | 2  | 3  |
> > | mysdir | 4  | 5  | 6  |
> > +++++
> >
> > I don't know why in your example, you are getting a dir0 directory when
> > selecting a specific file. These directories should only be included when
> > the specified table is a directory which contains subdirectories. Any
> query
> > to a specific file or to a directory that only contains regular files
> > should not return dir* columns.
> > I think this is the correct behavior.
> >
> > The fact that `mytidir` and `mytdir/mysdir` have different columns is
> not a
> > problem, because they are different tables.
> >
> > I do think Daniel's idea of adding the file name as well makes sense. I'm
> > also open to Ted's idea for return a dir array instead of individual
> > columns.
> >
> > On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde 
> wrote:
> >
> >>> Ted wrote:
> >>>
> >>> For one thing, I can make a really slow version of [find] !
> >>
> >> Why does it have to be slow? Seriously, so many of the tools we use
> >> daily have quasi-query facilities (find, git log, du, ps, netstat) and
> >> we cobble together queries using complex options and pipelines of unix
> >> commands. Relational algebra is a potentially MORE efficient.
> >>
> >> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
> >> and wish I could write ' ... order by count(*) desc'.
> >>
> >>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde 
> wrote:
> >>> +1 to returning directories as context. Very useful feature. Could be
> >>> used to return context for other adapters (e.g. an adapter that
> >>> concatenates all versions of versioned logfiles).
> >>>
> >>> +1 making dir an array, per Ted's suggestion
> >>>
> >>> I think dir should not appear in *; thus you'd have to write
> >>>
> >>>  select dir, * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> This behavior is analogous to Oracle's ROWID. It is not a column as
> >>> such, but a system function that you can apply to a row.
> >>>
> >>> You need to allow qualifiers:
> >>>
> >>>  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> >>> x, `/mytdir/mysdir/myfile2.json` as y
> >>>
> >>> and
> >>>
> >>>  select dir from `/mytdir/mysdir/myfile.json` as x,
> >>> `/mytdir/mysdir/myfile2.json` as y
> >>>
> >>> would be illegal because dir is ambiguous.
> >>>
> >>> You should make dir a reserved word (like ROWID).
> >>>
> >>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning 
> >> wrote:
>  Great point.
> 
>  Having the file name itself is very handy.
> 
> 
>  For one thing, I can make a really slow version of [find] !
> 
>  (seriously, I would love this)
> 
> 
>  On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
>  challapallira...@gmail.com> wrote:
> 
> > I am also under the opinion that we should not assume knowledge on
> the
> >> user
> > front for data discovery. So we should either have 'dir' columns in
> >>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Tomer Shiran

+1 to adding the filename (needed this last week, I had .json files 
and wanted to join with another table)
+1 to using an array dirs[]
+1 to not having it in * (but would "select dirs, *" work?)



> On Apr 23, 2015, at 7:00 PM, Steven Phillips  wrote:
> 
> What you are showing for the current behavior seems wrong to me:
> 
> $ tree mytdir
> mytdir
> └── mysdir
>└── myFile.json
> 
> $ cat mytdir/mysdir/myFile.json
> {a:1,b:2,c:3}
> {a:4,b:5,c:6}
> 
> 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> ++++
> | a  | b  | c  |
> ++++
> | 1  | 2  | 3  |
> | 4  | 5  | 6  |
> ++++
> 2 rows selected (0.274 seconds)
> 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
> ++++
> | a  | b  | c  |
> ++++
> | 1  | 2  | 3  |
> | 4  | 5  | 6  |
> ++++
> 2 rows selected (0.152 seconds)
> 0: jdbc:drill:> select * from `/mytdir/mysdir`;
> ++++
> | a  | b  | c  |
> ++++
> | 1  | 2  | 3  |
> | 4  | 5  | 6  |
> ++++
> 2 rows selected (0.157 seconds)
> 0: jdbc:drill:> select * from `mytdir`;
> +++++
> |dir0| a  | b  | c  |
> +++++
> | mysdir | 1  | 2  | 3  |
> | mysdir | 4  | 5  | 6  |
> +++++
> 
> I don't know why in your example, you are getting a dir0 directory when
> selecting a specific file. These directories should only be included when
> the specified table is a directory which contains subdirectories. Any query
> to a specific file or to a directory that only contains regular files
> should not return dir* columns.
> I think this is the correct behavior.
> 
> The fact that `mytidir` and `mytdir/mysdir` have different columns is not a
> problem, because they are different tables.
> 
> I do think Daniel's idea of adding the file name as well makes sense. I'm
> also open to Ted's idea for return a dir array instead of individual
> columns.
> 
> On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde  wrote:
> 
>>> Ted wrote:
>>> 
>>> For one thing, I can make a really slow version of [find] !
>> 
>> Why does it have to be slow? Seriously, so many of the tools we use
>> daily have quasi-query facilities (find, git log, du, ps, netstat) and
>> we cobble together queries using complex options and pipelines of unix
>> commands. Relational algebra is a potentially MORE efficient.
>> 
>> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
>> and wish I could write ' ... order by count(*) desc'.
>> 
>>> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde  wrote:
>>> +1 to returning directories as context. Very useful feature. Could be
>>> used to return context for other adapters (e.g. an adapter that
>>> concatenates all versions of versioned logfiles).
>>> 
>>> +1 making dir an array, per Ted's suggestion
>>> 
>>> I think dir should not appear in *; thus you'd have to write
>>> 
>>>  select dir, * from `/mytdir/mysdir/myfile.json`
>>> 
>>> This behavior is analogous to Oracle's ROWID. It is not a column as
>>> such, but a system function that you can apply to a row.
>>> 
>>> You need to allow qualifiers:
>>> 
>>>  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
>>> x, `/mytdir/mysdir/myfile2.json` as y
>>> 
>>> and
>>> 
>>>  select dir from `/mytdir/mysdir/myfile.json` as x,
>>> `/mytdir/mysdir/myfile2.json` as y
>>> 
>>> would be illegal because dir is ambiguous.
>>> 
>>> You should make dir a reserved word (like ROWID).
>>> 
>>> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning 
>> wrote:
 Great point.
 
 Having the file name itself is very handy.
 
 
 For one thing, I can make a really slow version of [find] !
 
 (seriously, I would love this)
 
 
 On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
 challapallira...@gmail.com> wrote:
 
> I am also under the opinion that we should not assume knowledge on the
>> user
> front for data discovery. So we should either have 'dir' columns in
>> 'select
> *' or support a variation that Ted suggested.
> Also the folder names compliment the actual data in some cases.
> 
> - Rahul
> 
> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay >> 
> wrote:
> 
>> Regarding the use case in which the user stores information in
>> pathnames:
>> 
>> Since Drill supports that use case partially, shouldn't

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Steven Phillips

What you are showing for the current behavior seems wrong to me:

$ tree mytdir
mytdir
└── mysdir
└── myFile.json

$ cat mytdir/mysdir/myFile.json
{a:1,b:2,c:3}
{a:4,b:5,c:6}

0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
++++
| a  | b  | c  |
++++
| 1  | 2  | 3  |
| 4  | 5  | 6  |
++++
2 rows selected (0.274 seconds)
0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
++++
| a  | b  | c  |
++++
| 1  | 2  | 3  |
| 4  | 5  | 6  |
++++
2 rows selected (0.152 seconds)
0: jdbc:drill:> select * from `/mytdir/mysdir`;
++++
| a  | b  | c  |
++++
| 1  | 2  | 3  |
| 4  | 5  | 6  |
++++
2 rows selected (0.157 seconds)
0: jdbc:drill:> select * from `mytdir`;
+++++
|dir0| a  | b  | c  |
+++++
| mysdir | 1  | 2  | 3  |
| mysdir | 4  | 5  | 6  |
+++++

I don't know why in your example, you are getting a dir0 directory when
selecting a specific file. These directories should only be included when
the specified table is a directory which contains subdirectories. Any query
to a specific file or to a directory that only contains regular files
should not return dir* columns.
I think this is the correct behavior.

The fact that `mytidir` and `mytdir/mysdir` have different columns is not a
problem, because they are different tables.

I do think Daniel's idea of adding the file name as well makes sense. I'm
also open to Ted's idea for return a dir array instead of individual
columns.

On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde  wrote:

> > Ted wrote:
> >
> > For one thing, I can make a really slow version of [find] !
>
> Why does it have to be slow? Seriously, so many of the tools we use
> daily have quasi-query facilities (find, git log, du, ps, netstat) and
> we cobble together queries using complex options and pipelines of unix
> commands. Relational algebra is a potentially MORE efficient.
>
> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
> and wish I could write ' ... order by count(*) desc'.
>
> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde  wrote:
> > +1 to returning directories as context. Very useful feature. Could be
> > used to return context for other adapters (e.g. an adapter that
> > concatenates all versions of versioned logfiles).
> >
> > +1 making dir an array, per Ted's suggestion
> >
> > I think dir should not appear in *; thus you'd have to write
> >
> >   select dir, * from `/mytdir/mysdir/myfile.json`
> >
> > This behavior is analogous to Oracle's ROWID. It is not a column as
> > such, but a system function that you can apply to a row.
> >
> > You need to allow qualifiers:
> >
> >   select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> > x, `/mytdir/mysdir/myfile2.json` as y
> >
> > and
> >
> >   select dir from `/mytdir/mysdir/myfile.json` as x,
> > `/mytdir/mysdir/myfile2.json` as y
> >
> > would be illegal because dir is ambiguous.
> >
> > You should make dir a reserved word (like ROWID).
> >
> > On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning 
> wrote:
> >> Great point.
> >>
> >> Having the file name itself is very handy.
> >>
> >>
> >> For one thing, I can make a really slow version of [find] !
> >>
> >> (seriously, I would love this)
> >>
> >>
> >> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
> >> challapallira...@gmail.com> wrote:
> >>
> >>> I am also under the opinion that we should not assume knowledge on the
> user
> >>> front for data discovery. So we should either have 'dir' columns in
> 'select
> >>> *' or support a variation that Ted suggested.
> >>> Also the folder names compliment the actual data in some cases.
> >>>
> >>> - Rahul
> >>>
> >>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay  >
> >>> wrote:
> >>>
> >>> > Regarding the use case in which the user stores information in
> pathnames:
> >>> >
> >>> > Since Drill supports that use case partially, shouldn't it do so more
> >>> > completely?  In particular, since Drill provides access to subtree
> >>> > pathname segments before the last one (the segments for directories),
> >>> > should Drill provide access to the last one too (the simple file
> name)?
> >>> >
> >>> >
> >>> > We support reading cases like this:
> >>> > - root/
> >>> > - root/2015/
> >>> > - root/2015/01/
> >>> > - root/2015/01/01/
> >>> > - ro

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Julian Hyde

> Ted wrote:
>
> For one thing, I can make a really slow version of [find] !

Why does it have to be slow? Seriously, so many of the tools we use
daily have quasi-query facilities (find, git log, du, ps, netstat) and
we cobble together queries using complex options and pipelines of unix
commands. Relational algebra is a potentially MORE efficient.

I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
and wish I could write ' ... order by count(*) desc'.

On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde  wrote:
> +1 to returning directories as context. Very useful feature. Could be
> used to return context for other adapters (e.g. an adapter that
> concatenates all versions of versioned logfiles).
>
> +1 making dir an array, per Ted's suggestion
>
> I think dir should not appear in *; thus you'd have to write
>
>   select dir, * from `/mytdir/mysdir/myfile.json`
>
> This behavior is analogous to Oracle's ROWID. It is not a column as
> such, but a system function that you can apply to a row.
>
> You need to allow qualifiers:
>
>   select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> x, `/mytdir/mysdir/myfile2.json` as y
>
> and
>
>   select dir from `/mytdir/mysdir/myfile.json` as x,
> `/mytdir/mysdir/myfile2.json` as y
>
> would be illegal because dir is ambiguous.
>
> You should make dir a reserved word (like ROWID).
>
> On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning  wrote:
>> Great point.
>>
>> Having the file name itself is very handy.
>>
>>
>> For one thing, I can make a really slow version of [find] !
>>
>> (seriously, I would love this)
>>
>>
>> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
>> challapallira...@gmail.com> wrote:
>>
>>> I am also under the opinion that we should not assume knowledge on the user
>>> front for data discovery. So we should either have 'dir' columns in 'select
>>> *' or support a variation that Ted suggested.
>>> Also the folder names compliment the actual data in some cases.
>>>
>>> - Rahul
>>>
>>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
>>> wrote:
>>>
>>> > Regarding the use case in which the user stores information in pathnames:
>>> >
>>> > Since Drill supports that use case partially, shouldn't it do so more
>>> > completely?  In particular, since Drill provides access to subtree
>>> > pathname segments before the last one (the segments for directories),
>>> > should Drill provide access to the last one too (the simple file name)?
>>> >
>>> >
>>> > We support reading cases like this:
>>> > - root/
>>> > - root/2015/
>>> > - root/2015/01/
>>> > - root/2015/01/01/
>>> > - root/2015/01/01/log.json
>>> > - root/2015/02/
>>> > - root/2015/02/02/
>>> > - root/2015/02/02/log.json
>>> >
>>> > In particular, querying "select ... from `root` ..." includes the
>>> > date-portion segments of the pathnames in the dir0, etc, columns.
>>> >
>>> > Note that the user might not redundantly store the dates inside the
>>> > files themselves, since the dates are known to exist in the directory
>>> > names.
>>> >
>>> >
>>> > However, we don't support this variation of that case, right?:
>>> >
>>> > - root/
>>> > - root/2015
>>> > - root/2015/01/
>>> > - root/2015/01/log_01.json
>>> > - root/2015/02/
>>> > - root/2015/02/log_02.json
>>> >
>>> > In particular, Drill includes several segments of the pathname after
>>> > the root of the subtree, but does not include the last segment--which
>>> > contains data just as the segments that _are_ included do.
>>> >
>>> > (Yes, the last segment usually contains artifacts besides the contained
>>> > data (e.g., the file extension) and the user would have to specify how
>>> > to interpret the file simple name segment as data, but the user has to
>>> > specify the interpretation for the other segments anyway.)
>>> >
>>> >
>>> > Daniel
>>> >
>>> >
>>> >
>>> > Ted Dunning wrote:
>>> >
>>> >> I would propose that dir be an array that contains all of the
>>> directories
>>> >> rather than having multiple values.
>>> >>
>>> >> The multiple names are particularly inconvenient if files are are
>>> >> different
>>> >> depths.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
>>> >> wrote:
>>> >>
>>> >>  I'm specifically arguing that SELECT * doesn't return the columns.
>>> >>>
>>> >>> Here is current behavior:
>>> >>>
>>> >>> /mytdir/mysdir/myfile.json
>>> >>> {a:1,b:2,c:3}
>>> >>> {a:4,b:5,c:6}
>>> >>>
>>> >>> select * from `myfile.json`
>>> >>>
>>> >>> a, b, c
>>> >>> 1, 2, 3
>>> >>> 4, 5, 6
>>> >>>
>>> >>> select * from `/mysdir/myfile.json`
>>> >>>
>>> >>> dir0 a, b, c
>>> >>> mysdir, 1, 2, 3
>>> >>> mysdir, 4, 5, 6
>>> >>>
>>> >>> select * from `/mytdir/mysdir/myfile.json`
>>> >>>
>>> >>> dir0, dir1 a, b, c
>>> >>> mytdir, mysdir, 1, 2, 3
>>> >>> mytdir, mysdir, 4, 5, 6
>>> >>>
>>> >>>
>>> >>> 
>>> >>> My proposal:
>>> >>>
>>> >>> select * from `myfile.json`
>>> >>> select * from `/mysdir/myfile.json`
>>> >>> select * from `/mytdir/mysdir/myfile.json`
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Julian Hyde

+1 to returning directories as context. Very useful feature. Could be
used to return context for other adapters (e.g. an adapter that
concatenates all versions of versioned logfiles).

+1 making dir an array, per Ted's suggestion

I think dir should not appear in *; thus you'd have to write

  select dir, * from `/mytdir/mysdir/myfile.json`

This behavior is analogous to Oracle's ROWID. It is not a column as
such, but a system function that you can apply to a row.

You need to allow qualifiers:

  select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
x, `/mytdir/mysdir/myfile2.json` as y

and

  select dir from `/mytdir/mysdir/myfile.json` as x,
`/mytdir/mysdir/myfile2.json` as y

would be illegal because dir is ambiguous.

You should make dir a reserved word (like ROWID).

On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning  wrote:
> Great point.
>
> Having the file name itself is very handy.
>
>
> For one thing, I can make a really slow version of [find] !
>
> (seriously, I would love this)
>
>
> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
>> I am also under the opinion that we should not assume knowledge on the user
>> front for data discovery. So we should either have 'dir' columns in 'select
>> *' or support a variation that Ted suggested.
>> Also the folder names compliment the actual data in some cases.
>>
>> - Rahul
>>
>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
>> wrote:
>>
>> > Regarding the use case in which the user stores information in pathnames:
>> >
>> > Since Drill supports that use case partially, shouldn't it do so more
>> > completely?  In particular, since Drill provides access to subtree
>> > pathname segments before the last one (the segments for directories),
>> > should Drill provide access to the last one too (the simple file name)?
>> >
>> >
>> > We support reading cases like this:
>> > - root/
>> > - root/2015/
>> > - root/2015/01/
>> > - root/2015/01/01/
>> > - root/2015/01/01/log.json
>> > - root/2015/02/
>> > - root/2015/02/02/
>> > - root/2015/02/02/log.json
>> >
>> > In particular, querying "select ... from `root` ..." includes the
>> > date-portion segments of the pathnames in the dir0, etc, columns.
>> >
>> > Note that the user might not redundantly store the dates inside the
>> > files themselves, since the dates are known to exist in the directory
>> > names.
>> >
>> >
>> > However, we don't support this variation of that case, right?:
>> >
>> > - root/
>> > - root/2015
>> > - root/2015/01/
>> > - root/2015/01/log_01.json
>> > - root/2015/02/
>> > - root/2015/02/log_02.json
>> >
>> > In particular, Drill includes several segments of the pathname after
>> > the root of the subtree, but does not include the last segment--which
>> > contains data just as the segments that _are_ included do.
>> >
>> > (Yes, the last segment usually contains artifacts besides the contained
>> > data (e.g., the file extension) and the user would have to specify how
>> > to interpret the file simple name segment as data, but the user has to
>> > specify the interpretation for the other segments anyway.)
>> >
>> >
>> > Daniel
>> >
>> >
>> >
>> > Ted Dunning wrote:
>> >
>> >> I would propose that dir be an array that contains all of the
>> directories
>> >> rather than having multiple values.
>> >>
>> >> The multiple names are particularly inconvenient if files are are
>> >> different
>> >> depths.
>> >>
>> >>
>> >>
>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
>> >> wrote:
>> >>
>> >>  I'm specifically arguing that SELECT * doesn't return the columns.
>> >>>
>> >>> Here is current behavior:
>> >>>
>> >>> /mytdir/mysdir/myfile.json
>> >>> {a:1,b:2,c:3}
>> >>> {a:4,b:5,c:6}
>> >>>
>> >>> select * from `myfile.json`
>> >>>
>> >>> a, b, c
>> >>> 1, 2, 3
>> >>> 4, 5, 6
>> >>>
>> >>> select * from `/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mysdir, 1, 2, 3
>> >>> mysdir, 4, 5, 6
>> >>>
>> >>> select * from `/mytdir/mysdir/myfile.json`
>> >>>
>> >>> dir0, dir1 a, b, c
>> >>> mytdir, mysdir, 1, 2, 3
>> >>> mytdir, mysdir, 4, 5, 6
>> >>>
>> >>>
>> >>> 
>> >>> My proposal:
>> >>>
>> >>> select * from `myfile.json`
>> >>> select * from `/mysdir/myfile.json`
>> >>> select * from `/mytdir/mysdir/myfile.json`
>> >>> ::all produce::
>> >>> a, b, c
>> >>> 1, 2, 3
>> >>> 4, 5, 6
>> >>>
>> >>> select dir0, a, b, c from `/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mysdir, 1, 2, 3
>> >>> mysdir, 4, 5, 6
>> >>>
>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>> >>>
>> >>> dir0 a, b, c
>> >>> mytdir, 1, 2, 3
>> >>> mytdir, 4, 5, 6
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha 
>> wrote:
>> >>>
>> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
>> 
>>  On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
>>  wrote:
>> 
>>   Hey guys,
>> >
>> > I've been thinking that always showing dir# columns

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Jinfeng Ni

To Parth's question,

1) SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`

dir[i] would not returned.

2) SELECT dir0, * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
dir0 would be returned.

3) SELECT dir, * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
the array of dir would be returned.

If user does not explicitly ask for those special field (dir), why do we
always include them in the result by default?  What if user does not want
to have those field ? Is there an easy way to allow the user to express the
semantics that they do not want those fields?

To me, it makes more sense that * means the regular fields in the
file/table, and dir are special fields which are included in the result
only when user explicitly asks for them.




On Thu, Apr 23, 2015 at 5:01 PM, Parth Chandra 
wrote:

> A common use case (as Daniel's example pointed out) is to arrange data in
> directories by date and look for the newest date.
>
> Something like this:
>
> Directory structure -
>
>   2015-04-01/subdir/data.json
>   2015-04-02/subdir/data.json
>   2015-04-03/subdir/data.json
>   .
>   .
>
> Then query for the latest data available
>
> SELECT * FROM `*/subdir/data.json` WHERE `dir0` IN (SELECT MAX(`dir0`) FROM
> `*/subdir` )
>
> or even -
>
> SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`
>
>
> Would dir[i] be returned in this query?
>
>
>
>
>
>
> On Thu, Apr 23, 2015 at 4:48 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > I am also under the opinion that we should not assume knowledge on the
> user
> > front for data discovery. So we should either have 'dir' columns in
> 'select
> > *' or support a variation that Ted suggested.
> > Also the folder names compliment the actual data in some cases.
> >
> > - Rahul
> >
> > On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
> > wrote:
> >
> > > Regarding the use case in which the user stores information in
> pathnames:
> > >
> > > Since Drill supports that use case partially, shouldn't it do so more
> > > completely?  In particular, since Drill provides access to subtree
> > > pathname segments before the last one (the segments for directories),
> > > should Drill provide access to the last one too (the simple file name)?
> > >
> > >
> > > We support reading cases like this:
> > > - root/
> > > - root/2015/
> > > - root/2015/01/
> > > - root/2015/01/01/
> > > - root/2015/01/01/log.json
> > > - root/2015/02/
> > > - root/2015/02/02/
> > > - root/2015/02/02/log.json
> > >
> > > In particular, querying "select ... from `root` ..." includes the
> > > date-portion segments of the pathnames in the dir0, etc, columns.
> > >
> > > Note that the user might not redundantly store the dates inside the
> > > files themselves, since the dates are known to exist in the directory
> > > names.
> > >
> > >
> > > However, we don't support this variation of that case, right?:
> > >
> > > - root/
> > > - root/2015
> > > - root/2015/01/
> > > - root/2015/01/log_01.json
> > > - root/2015/02/
> > > - root/2015/02/log_02.json
> > >
> > > In particular, Drill includes several segments of the pathname after
> > > the root of the subtree, but does not include the last segment--which
> > > contains data just as the segments that _are_ included do.
> > >
> > > (Yes, the last segment usually contains artifacts besides the contained
> > > data (e.g., the file extension) and the user would have to specify how
> > > to interpret the file simple name segment as data, but the user has to
> > > specify the interpretation for the other segments anyway.)
> > >
> > >
> > > Daniel
> > >
> > >
> > >
> > > Ted Dunning wrote:
> > >
> > >> I would propose that dir be an array that contains all of the
> > directories
> > >> rather than having multiple values.
> > >>
> > >> The multiple names are particularly inconvenient if files are are
> > >> different
> > >> depths.
> > >>
> > >>
> > >>
> > >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
> > >> wrote:
> > >>
> > >>  I'm specifically arguing that SELECT * doesn't return the columns.
> > >>>
> > >>> Here is current behavior:
> > >>>
> > >>> /mytdir/mysdir/myfile.json
> > >>> {a:1,b:2,c:3}
> > >>> {a:4,b:5,c:6}
> > >>>
> > >>> select * from `myfile.json`
> > >>>
> > >>> a, b, c
> > >>> 1, 2, 3
> > >>> 4, 5, 6
> > >>>
> > >>> select * from `/mysdir/myfile.json`
> > >>>
> > >>> dir0 a, b, c
> > >>> mysdir, 1, 2, 3
> > >>> mysdir, 4, 5, 6
> > >>>
> > >>> select * from `/mytdir/mysdir/myfile.json`
> > >>>
> > >>> dir0, dir1 a, b, c
> > >>> mytdir, mysdir, 1, 2, 3
> > >>> mytdir, mysdir, 4, 5, 6
> > >>>
> > >>>
> > >>> 
> > >>> My proposal:
> > >>>
> > >>> select * from `myfile.json`
> > >>> select * from `/mysdir/myfile.json`
> > >>> select * from `/mytdir/mysdir/myfile.json`
> > >>> ::all produce::
> > >>> a, b, c
> > >>> 1, 2, 3
> > >>> 4, 5, 6
> > >>>
> > >>> select dir0, a, b, c from `/mysdir/myfile.json`
> > >>>
> > >>> dir0 a, b, c
> > >>> mysdir, 1, 2, 3
> > >>> mysd

[jira] [Created] (DRILL-2868) Drill returning incorrect data when we have fields missing in some of the files

2015-04-23 Thread Rahul Challapalli (JIRA)

Rahul Challapalli created DRILL-2868:


 Summary: Drill returning incorrect data when we have fields 
missing in some of the files
 Key: DRILL-2868
 URL: https://issues.apache.org/jira/browse/DRILL-2868
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators, Storage - JSON, Storage 
- Parquet
Reporter: Rahul Challapalli
Assignee: Hanifi Gunes
Priority: Critical


git.commit.id.abbrev=5cd36c5

Data File1 : a.json
{code}
{ "c1" : 1, "m1" : {"m2" : {"m3" : {"c2" : 5} } } }
{ "c1" : 2, "m1" : {"m2" : {"m3" : {"c2" : 6} } } }
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{code}

Data File2 : b.json
{code}
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{code}

Data File3 : c.json
{code}
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{ "c1" : 3, "m1" : {"m2" : {"c2" : 5} } }
{code}

The below query reports incorrect data :
{code}
select t.m1.m2.m3 from `delme_repro` as `t`;
++
|   EXPR$0   |
++
| null   |
| null   |
| null   |
| null   |
| null   |
| null   |
| null   |
| null   |
| null   |
++
9 rows selected (0.139 seconds)
{code}

However if I run the same query on the specific file, I get the correct output
{code}
select t.m1.m2.m3 from `delme_repro/a.json` as `t`;
++
|   EXPR$0   |
++
| {"c2":5}   |
| {"c2":6}   |
| {} |
++
3 rows selected (0.113 seconds)
{code}

It looks like the file size plays a part in deciding the order in which Drill 
reads the files. But there could be more to this than just the order because 
when I made sure that 'b.json' and 'c.json' only had one records, drill 
correctly reported the data.

Let me know if you have any questions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2867) Session level parameter drill.exec.testing.controls appears to be set if though it was not

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2867:
---

 Summary: Session level parameter drill.exec.testing.controls  
appears to be set if though it was not
 Key: DRILL-2867
 URL: https://issues.apache.org/jira/browse/DRILL-2867
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Victoria Markman
Assignee: Sudheesh Katkam


{code}
0: jdbc:drill:schema=dfs> select * from sys.options where type like '%SESSION%';
++++++++
|name|kind|type|  num_val   | string_val |  bool_val  | 
float_val  |
++++++++
| drill.exec.testing.controls | STRING | SESSION| null   | {}   
  | null   | null   |
++++++++
1 row selected (0.218 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Ted Dunning

Great point.

Having the file name itself is very handy.


For one thing, I can make a really slow version of [find] !

(seriously, I would love this)


On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> I am also under the opinion that we should not assume knowledge on the user
> front for data discovery. So we should either have 'dir' columns in 'select
> *' or support a variation that Ted suggested.
> Also the folder names compliment the actual data in some cases.
>
> - Rahul
>
> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
> wrote:
>
> > Regarding the use case in which the user stores information in pathnames:
> >
> > Since Drill supports that use case partially, shouldn't it do so more
> > completely?  In particular, since Drill provides access to subtree
> > pathname segments before the last one (the segments for directories),
> > should Drill provide access to the last one too (the simple file name)?
> >
> >
> > We support reading cases like this:
> > - root/
> > - root/2015/
> > - root/2015/01/
> > - root/2015/01/01/
> > - root/2015/01/01/log.json
> > - root/2015/02/
> > - root/2015/02/02/
> > - root/2015/02/02/log.json
> >
> > In particular, querying "select ... from `root` ..." includes the
> > date-portion segments of the pathnames in the dir0, etc, columns.
> >
> > Note that the user might not redundantly store the dates inside the
> > files themselves, since the dates are known to exist in the directory
> > names.
> >
> >
> > However, we don't support this variation of that case, right?:
> >
> > - root/
> > - root/2015
> > - root/2015/01/
> > - root/2015/01/log_01.json
> > - root/2015/02/
> > - root/2015/02/log_02.json
> >
> > In particular, Drill includes several segments of the pathname after
> > the root of the subtree, but does not include the last segment--which
> > contains data just as the segments that _are_ included do.
> >
> > (Yes, the last segment usually contains artifacts besides the contained
> > data (e.g., the file extension) and the user would have to specify how
> > to interpret the file simple name segment as data, but the user has to
> > specify the interpretation for the other segments anyway.)
> >
> >
> > Daniel
> >
> >
> >
> > Ted Dunning wrote:
> >
> >> I would propose that dir be an array that contains all of the
> directories
> >> rather than having multiple values.
> >>
> >> The multiple names are particularly inconvenient if files are are
> >> different
> >> depths.
> >>
> >>
> >>
> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
> >> wrote:
> >>
> >>  I'm specifically arguing that SELECT * doesn't return the columns.
> >>>
> >>> Here is current behavior:
> >>>
> >>> /mytdir/mysdir/myfile.json
> >>> {a:1,b:2,c:3}
> >>> {a:4,b:5,c:6}
> >>>
> >>> select * from `myfile.json`
> >>>
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select * from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0, dir1 a, b, c
> >>> mytdir, mysdir, 1, 2, 3
> >>> mytdir, mysdir, 4, 5, 6
> >>>
> >>>
> >>> 
> >>> My proposal:
> >>>
> >>> select * from `myfile.json`
> >>> select * from `/mysdir/myfile.json`
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> ::all produce::
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mytdir, 1, 2, 3
> >>> mytdir, 4, 5, 6
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha 
> wrote:
> >>>
> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
> 
>  On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
>  wrote:
> 
>   Hey guys,
> >
> > I've been thinking that always showing dir# columns seems to alter
> data
> > returned from Drill depending on how you select the directory.  I'd
> >
>  propose
> 
> > that we make it so that we only return dir# columns when they are
> > explicitly requested.
> >
> > Thoughts?
> >
> >
> 
> >>>
> >>
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Parth Chandra

A common use case (as Daniel's example pointed out) is to arrange data in
directories by date and look for the newest date.

Something like this:

Directory structure -

  2015-04-01/subdir/data.json
  2015-04-02/subdir/data.json
  2015-04-03/subdir/data.json
  .
  .

Then query for the latest data available

SELECT * FROM `*/subdir/data.json` WHERE `dir0` IN (SELECT MAX(`dir0`) FROM
`*/subdir` )

or even -

SELECT * FROM `*/subdir/data.json` WHERE `dir0` = '2015-04-03`


Would dir[i] be returned in this query?






On Thu, Apr 23, 2015 at 4:48 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> I am also under the opinion that we should not assume knowledge on the user
> front for data discovery. So we should either have 'dir' columns in 'select
> *' or support a variation that Ted suggested.
> Also the folder names compliment the actual data in some cases.
>
> - Rahul
>
> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
> wrote:
>
> > Regarding the use case in which the user stores information in pathnames:
> >
> > Since Drill supports that use case partially, shouldn't it do so more
> > completely?  In particular, since Drill provides access to subtree
> > pathname segments before the last one (the segments for directories),
> > should Drill provide access to the last one too (the simple file name)?
> >
> >
> > We support reading cases like this:
> > - root/
> > - root/2015/
> > - root/2015/01/
> > - root/2015/01/01/
> > - root/2015/01/01/log.json
> > - root/2015/02/
> > - root/2015/02/02/
> > - root/2015/02/02/log.json
> >
> > In particular, querying "select ... from `root` ..." includes the
> > date-portion segments of the pathnames in the dir0, etc, columns.
> >
> > Note that the user might not redundantly store the dates inside the
> > files themselves, since the dates are known to exist in the directory
> > names.
> >
> >
> > However, we don't support this variation of that case, right?:
> >
> > - root/
> > - root/2015
> > - root/2015/01/
> > - root/2015/01/log_01.json
> > - root/2015/02/
> > - root/2015/02/log_02.json
> >
> > In particular, Drill includes several segments of the pathname after
> > the root of the subtree, but does not include the last segment--which
> > contains data just as the segments that _are_ included do.
> >
> > (Yes, the last segment usually contains artifacts besides the contained
> > data (e.g., the file extension) and the user would have to specify how
> > to interpret the file simple name segment as data, but the user has to
> > specify the interpretation for the other segments anyway.)
> >
> >
> > Daniel
> >
> >
> >
> > Ted Dunning wrote:
> >
> >> I would propose that dir be an array that contains all of the
> directories
> >> rather than having multiple values.
> >>
> >> The multiple names are particularly inconvenient if files are are
> >> different
> >> depths.
> >>
> >>
> >>
> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
> >> wrote:
> >>
> >>  I'm specifically arguing that SELECT * doesn't return the columns.
> >>>
> >>> Here is current behavior:
> >>>
> >>> /mytdir/mysdir/myfile.json
> >>> {a:1,b:2,c:3}
> >>> {a:4,b:5,c:6}
> >>>
> >>> select * from `myfile.json`
> >>>
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select * from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0, dir1 a, b, c
> >>> mytdir, mysdir, 1, 2, 3
> >>> mytdir, mysdir, 4, 5, 6
> >>>
> >>>
> >>> 
> >>> My proposal:
> >>>
> >>> select * from `myfile.json`
> >>> select * from `/mysdir/myfile.json`
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> ::all produce::
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mytdir, 1, 2, 3
> >>> mytdir, 4, 5, 6
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha 
> wrote:
> >>>
> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
> 
>  On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
>  wrote:
> 
>   Hey guys,
> >
> > I've been thinking that always showing dir# columns seems to alter
> data
> > returned from Drill depending on how you select the directory.  I'd
> >
>  propose
> 
> > that we make it so that we only return dir# columns when they are
> > explicitly requested.
> >
> > Thoughts?
> >
> >
> 
> >>>
> >>
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread rahul challapalli

I am also under the opinion that we should not assume knowledge on the user
front for data discovery. So we should either have 'dir' columns in 'select
*' or support a variation that Ted suggested.
Also the folder names compliment the actual data in some cases.

- Rahul

On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay 
wrote:

> Regarding the use case in which the user stores information in pathnames:
>
> Since Drill supports that use case partially, shouldn't it do so more
> completely?  In particular, since Drill provides access to subtree
> pathname segments before the last one (the segments for directories),
> should Drill provide access to the last one too (the simple file name)?
>
>
> We support reading cases like this:
> - root/
> - root/2015/
> - root/2015/01/
> - root/2015/01/01/
> - root/2015/01/01/log.json
> - root/2015/02/
> - root/2015/02/02/
> - root/2015/02/02/log.json
>
> In particular, querying "select ... from `root` ..." includes the
> date-portion segments of the pathnames in the dir0, etc, columns.
>
> Note that the user might not redundantly store the dates inside the
> files themselves, since the dates are known to exist in the directory
> names.
>
>
> However, we don't support this variation of that case, right?:
>
> - root/
> - root/2015
> - root/2015/01/
> - root/2015/01/log_01.json
> - root/2015/02/
> - root/2015/02/log_02.json
>
> In particular, Drill includes several segments of the pathname after
> the root of the subtree, but does not include the last segment--which
> contains data just as the segments that _are_ included do.
>
> (Yes, the last segment usually contains artifacts besides the contained
> data (e.g., the file extension) and the user would have to specify how
> to interpret the file simple name segment as data, but the user has to
> specify the interpretation for the other segments anyway.)
>
>
> Daniel
>
>
>
> Ted Dunning wrote:
>
>> I would propose that dir be an array that contains all of the directories
>> rather than having multiple values.
>>
>> The multiple names are particularly inconvenient if files are are
>> different
>> depths.
>>
>>
>>
>> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
>> wrote:
>>
>>  I'm specifically arguing that SELECT * doesn't return the columns.
>>>
>>> Here is current behavior:
>>>
>>> /mytdir/mysdir/myfile.json
>>> {a:1,b:2,c:3}
>>> {a:4,b:5,c:6}
>>>
>>> select * from `myfile.json`
>>>
>>> a, b, c
>>> 1, 2, 3
>>> 4, 5, 6
>>>
>>> select * from `/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mysdir, 1, 2, 3
>>> mysdir, 4, 5, 6
>>>
>>> select * from `/mytdir/mysdir/myfile.json`
>>>
>>> dir0, dir1 a, b, c
>>> mytdir, mysdir, 1, 2, 3
>>> mytdir, mysdir, 4, 5, 6
>>>
>>>
>>> 
>>> My proposal:
>>>
>>> select * from `myfile.json`
>>> select * from `/mysdir/myfile.json`
>>> select * from `/mytdir/mysdir/myfile.json`
>>> ::all produce::
>>> a, b, c
>>> 1, 2, 3
>>> 4, 5, 6
>>>
>>> select dir0, a, b, c from `/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mysdir, 1, 2, 3
>>> mysdir, 4, 5, 6
>>>
>>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>>>
>>> dir0 a, b, c
>>> mytdir, 1, 2, 3
>>> mytdir, 4, 5, 6
>>>
>>>
>>>
>>>
>>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:
>>>
>>>  Seems reasonable, as long as SELECT * also returns the dir# columns.

 On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
 wrote:

  Hey guys,
>
> I've been thinking that always showing dir# columns seems to alter data
> returned from Drill depending on how you select the directory.  I'd
>
 propose

> that we make it so that we only return dir# columns when they are
> explicitly requested.
>
> Thoughts?
>
>

>>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Daniel Barclay


Regarding the use case in which the user stores information in pathnames:

Since Drill supports that use case partially, shouldn't it do so more
completely?  In particular, since Drill provides access to subtree
pathname segments before the last one (the segments for directories),
should Drill provide access to the last one too (the simple file name)?


We support reading cases like this:
- root/
- root/2015/
- root/2015/01/
- root/2015/01/01/
- root/2015/01/01/log.json
- root/2015/02/
- root/2015/02/02/
- root/2015/02/02/log.json

In particular, querying "select ... from `root` ..." includes the
date-portion segments of the pathnames in the dir0, etc, columns.

Note that the user might not redundantly store the dates inside the
files themselves, since the dates are known to exist in the directory
names.


However, we don't support this variation of that case, right?:

- root/
- root/2015
- root/2015/01/
- root/2015/01/log_01.json
- root/2015/02/
- root/2015/02/log_02.json

In particular, Drill includes several segments of the pathname after
the root of the subtree, but does not include the last segment--which
contains data just as the segments that _are_ included do.

(Yes, the last segment usually contains artifacts besides the contained
data (e.g., the file extension) and the user would have to specify how
to interpret the file simple name segment as data, but the user has to
specify the interpretation for the other segments anyway.)


Daniel


Ted Dunning wrote:

I would propose that dir be an array that contains all of the directories
rather than having multiple values.

The multiple names are particularly inconvenient if files are are different
depths.



On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau  wrote:


I'm specifically arguing that SELECT * doesn't return the columns.

Here is current behavior:

/mytdir/mysdir/myfile.json
{a:1,b:2,c:3}
{a:4,b:5,c:6}

select * from `myfile.json`

a, b, c
1, 2, 3
4, 5, 6

select * from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select * from `/mytdir/mysdir/myfile.json`

dir0, dir1 a, b, c
mytdir, mysdir, 1, 2, 3
mytdir, mysdir, 4, 5, 6



My proposal:

select * from `myfile.json`
select * from `/mysdir/myfile.json`
select * from `/mytdir/mysdir/myfile.json`
::all produce::
a, b, c
1, 2, 3
4, 5, 6

select dir0, a, b, c from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select dir0, a, b, c from `/mytdir/mysdir/myfile.json`

dir0 a, b, c
mytdir, 1, 2, 3
mytdir, 4, 5, 6




On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:


Seems reasonable, as long as SELECT * also returns the dir# columns.

On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
wrote:


Hey guys,

I've been thinking that always showing dir# columns seems to alter data
returned from Drill depending on how you select the directory.  I'd

propose

that we make it so that we only return dir# columns when they are
explicitly requested.

Thoughts?










--
Daniel Barclay
MapR Technologies

[jira] [Created] (DRILL-2866) Incorrect error message reporting schema change when streaming aggregation and hash join are disabled

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2866:
---

 Summary: Incorrect error message reporting schema change when 
streaming aggregation and hash join are disabled
 Key: DRILL-2866
 URL: https://issues.apache.org/jira/browse/DRILL-2866
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Affects Versions: 0.9.0
Reporter: Victoria Markman
Assignee: Daniel Barclay (Drill)
 Attachments: t1.parquet

alter session set `planner.enable_streamagg` = false;
alter session set `planner.enable_hashjoin` = false;

{code}
0: jdbc:drill:schema=dfs> select  t1.a1,
. . . . . . . . . . . . > t1.b1,
. . . . . . . . . . . . > count(distinct t1.c1) as distinct_c1,
. . . . . . . . . . . . > count(distinct t2.c2) as distinct_c2,
. . . . . . . . . . . . > sum(t1.a1) as sum_a1,
. . . . . . . . . . . . > count(t1.c1) as count_a1,
. . . . . . . . . . . . > count(*) as count_star
. . . . . . . . . . . . > from
. . . . . . . . . . . . > t1,
. . . . . . . . . . . . > t2
. . . . . . . . . . . . > where
. . . . . . . . . . . . > t1.a1 = t2.a2 and t1.b1 = t2.b2
. . . . . . . . . . . . > group by
. . . . . . . . . . . . > t1.a1,
. . . . . . . . . . . . > t1.b1,
. . . . . . . . . . . . > t2.a2,
. . . . . . . . . . . . > t2.b2
. . . . . . . . . . . . > order by
. . . . . . . . . . . . > t1.a1,
. . . . . . . . . . . . > t1.b1,
. . . . . . . . . . . . > t2.a2,
. . . . . . . . . . . . > t2.b2
. . . . . . . . . . . . > ;
+++-+-++++
| a1 | b1 | distinct_c1 | distinct_c2 |   sum_a1   |  count_a1  
| count_star |
+++-+-++++
Query failed: SYSTEM ERROR: Hash aggregate does not support schema changes

Fragment 0:0

[10ee2422-d13c-4405-a4b6-a62358f72995 on atsqa4-134.qa.lab:31010]

  (org.apache.drill.exec.exception.SchemaChangeException) Hash aggregate does 
not support schema changes
{code}

copy/paste reproduction
{code}
select  t1.a1,
t1.b1,
count(distinct t1.c1) as distinct_c1,
count(distinct t2.c2) as distinct_c2,
sum(t1.a1) as sum_a1,
count(t1.c1) as count_a1,
count(*) as count_star
from
t1,
t2
where
t1.a1 = t2.a2 and t1.b1 = t2.b2
group by
t1.a1,
t1.b1,
t2.a2,
t2.b2
order by
t1.a1,
t1.b1,
t2.a2,
t2.b2
;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Ted Dunning

I see no problem with not including dir in *.  I can always say something
like:

 select dir, *

Looks strange to a SQL old-timer.  But is makes sense.



On Thu, Apr 23, 2015 at 7:31 PM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Exposing directories in select * queries enable data discovery rather than
> assuming knowledge on user front.
> Making dir as an array could be a good option to avoid the multi column
> issue.
>
> -Neeraja
>
> On Thu, Apr 23, 2015 at 3:57 PM, Ted Dunning 
> wrote:
>
> > I would propose that dir be an array that contains all of the directories
> > rather than having multiple values.
> >
> > The multiple names are particularly inconvenient if files are are
> different
> > depths.
> >
> >
> >
> > On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
> > wrote:
> >
> > > I'm specifically arguing that SELECT * doesn't return the columns.
> > >
> > > Here is current behavior:
> > >
> > > /mytdir/mysdir/myfile.json
> > > {a:1,b:2,c:3}
> > > {a:4,b:5,c:6}
> > >
> > > select * from `myfile.json`
> > >
> > > a, b, c
> > > 1, 2, 3
> > > 4, 5, 6
> > >
> > > select * from `/mysdir/myfile.json`
> > >
> > > dir0 a, b, c
> > > mysdir, 1, 2, 3
> > > mysdir, 4, 5, 6
> > >
> > > select * from `/mytdir/mysdir/myfile.json`
> > >
> > > dir0, dir1 a, b, c
> > > mytdir, mysdir, 1, 2, 3
> > > mytdir, mysdir, 4, 5, 6
> > >
> > >
> > > 
> > > My proposal:
> > >
> > > select * from `myfile.json`
> > > select * from `/mysdir/myfile.json`
> > > select * from `/mytdir/mysdir/myfile.json`
> > > ::all produce::
> > > a, b, c
> > > 1, 2, 3
> > > 4, 5, 6
> > >
> > > select dir0, a, b, c from `/mysdir/myfile.json`
> > >
> > > dir0 a, b, c
> > > mysdir, 1, 2, 3
> > > mysdir, 4, 5, 6
> > >
> > > select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> > >
> > > dir0 a, b, c
> > > mytdir, 1, 2, 3
> > > mytdir, 4, 5, 6
> > >
> > >
> > >
> > >
> > > On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha 
> wrote:
> > >
> > > > Seems reasonable, as long as SELECT * also returns the dir# columns.
> > > >
> > > > On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
> > > > wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I've been thinking that always showing dir# columns seems to alter
> > data
> > > > > returned from Drill depending on how you select the directory.  I'd
> > > > propose
> > > > > that we make it so that we only return dir# columns when they are
> > > > > explicitly requested.
> > > > >
> > > > > Thoughts?
> > > > >
> > > >
> > >
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Neeraja Rentachintala

Exposing directories in select * queries enable data discovery rather than
assuming knowledge on user front.
Making dir as an array could be a good option to avoid the multi column
issue.

-Neeraja

On Thu, Apr 23, 2015 at 3:57 PM, Ted Dunning  wrote:

> I would propose that dir be an array that contains all of the directories
> rather than having multiple values.
>
> The multiple names are particularly inconvenient if files are are different
> depths.
>
>
>
> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau 
> wrote:
>
> > I'm specifically arguing that SELECT * doesn't return the columns.
> >
> > Here is current behavior:
> >
> > /mytdir/mysdir/myfile.json
> > {a:1,b:2,c:3}
> > {a:4,b:5,c:6}
> >
> > select * from `myfile.json`
> >
> > a, b, c
> > 1, 2, 3
> > 4, 5, 6
> >
> > select * from `/mysdir/myfile.json`
> >
> > dir0 a, b, c
> > mysdir, 1, 2, 3
> > mysdir, 4, 5, 6
> >
> > select * from `/mytdir/mysdir/myfile.json`
> >
> > dir0, dir1 a, b, c
> > mytdir, mysdir, 1, 2, 3
> > mytdir, mysdir, 4, 5, 6
> >
> >
> > 
> > My proposal:
> >
> > select * from `myfile.json`
> > select * from `/mysdir/myfile.json`
> > select * from `/mytdir/mysdir/myfile.json`
> > ::all produce::
> > a, b, c
> > 1, 2, 3
> > 4, 5, 6
> >
> > select dir0, a, b, c from `/mysdir/myfile.json`
> >
> > dir0 a, b, c
> > mysdir, 1, 2, 3
> > mysdir, 4, 5, 6
> >
> > select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >
> > dir0 a, b, c
> > mytdir, 1, 2, 3
> > mytdir, 4, 5, 6
> >
> >
> >
> >
> > On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:
> >
> > > Seems reasonable, as long as SELECT * also returns the dir# columns.
> > >
> > > On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
> > > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I've been thinking that always showing dir# columns seems to alter
> data
> > > > returned from Drill depending on how you select the directory.  I'd
> > > propose
> > > > that we make it so that we only return dir# columns when they are
> > > > explicitly requested.
> > > >
> > > > Thoughts?
> > > >
> > >
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Ted Dunning

I would propose that dir be an array that contains all of the directories
rather than having multiple values.

The multiple names are particularly inconvenient if files are are different
depths.



On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau  wrote:

> I'm specifically arguing that SELECT * doesn't return the columns.
>
> Here is current behavior:
>
> /mytdir/mysdir/myfile.json
> {a:1,b:2,c:3}
> {a:4,b:5,c:6}
>
> select * from `myfile.json`
>
> a, b, c
> 1, 2, 3
> 4, 5, 6
>
> select * from `/mysdir/myfile.json`
>
> dir0 a, b, c
> mysdir, 1, 2, 3
> mysdir, 4, 5, 6
>
> select * from `/mytdir/mysdir/myfile.json`
>
> dir0, dir1 a, b, c
> mytdir, mysdir, 1, 2, 3
> mytdir, mysdir, 4, 5, 6
>
>
> 
> My proposal:
>
> select * from `myfile.json`
> select * from `/mysdir/myfile.json`
> select * from `/mytdir/mysdir/myfile.json`
> ::all produce::
> a, b, c
> 1, 2, 3
> 4, 5, 6
>
> select dir0, a, b, c from `/mysdir/myfile.json`
>
> dir0 a, b, c
> mysdir, 1, 2, 3
> mysdir, 4, 5, 6
>
> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>
> dir0 a, b, c
> mytdir, 1, 2, 3
> mytdir, 4, 5, 6
>
>
>
>
> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:
>
> > Seems reasonable, as long as SELECT * also returns the dir# columns.
> >
> > On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
> > wrote:
> >
> > > Hey guys,
> > >
> > > I've been thinking that always showing dir# columns seems to alter data
> > > returned from Drill depending on how you select the directory.  I'd
> > propose
> > > that we make it so that we only return dir# columns when they are
> > > explicitly requested.
> > >
> > > Thoughts?
> > >
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Jinfeng Ni

I think the new proposal makes sense. It makes the behavior of select *
consistent, only returning the regular columns in the table, regardless how
the table/file is specified in the query.

On Thu, Apr 23, 2015 at 2:56 PM, Jacques Nadeau  wrote:

> I'm specifically arguing that SELECT * doesn't return the columns.
>
> Here is current behavior:
>
> /mytdir/mysdir/myfile.json
> {a:1,b:2,c:3}
> {a:4,b:5,c:6}
>
> select * from `myfile.json`
>
> a, b, c
> 1, 2, 3
> 4, 5, 6
>
> select * from `/mysdir/myfile.json`
>
> dir0 a, b, c
> mysdir, 1, 2, 3
> mysdir, 4, 5, 6
>
> select * from `/mytdir/mysdir/myfile.json`
>
> dir0, dir1 a, b, c
> mytdir, mysdir, 1, 2, 3
> mytdir, mysdir, 4, 5, 6
>
>
> 
> My proposal:
>
> select * from `myfile.json`
> select * from `/mysdir/myfile.json`
> select * from `/mytdir/mysdir/myfile.json`
> ::all produce::
> a, b, c
> 1, 2, 3
> 4, 5, 6
>
> select dir0, a, b, c from `/mysdir/myfile.json`
>
> dir0 a, b, c
> mysdir, 1, 2, 3
> mysdir, 4, 5, 6
>
> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
>
> dir0 a, b, c
> mytdir, 1, 2, 3
> mytdir, 4, 5, 6
>
>
>
>
> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:
>
> > Seems reasonable, as long as SELECT * also returns the dir# columns.
> >
> > On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
> > wrote:
> >
> > > Hey guys,
> > >
> > > I've been thinking that always showing dir# columns seems to alter data
> > > returned from Drill depending on how you select the directory.  I'd
> > propose
> > > that we make it so that we only return dir# columns when they are
> > > explicitly requested.
> > >
> > > Thoughts?
> > >
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Jacques Nadeau

I'm specifically arguing that SELECT * doesn't return the columns.

Here is current behavior:

/mytdir/mysdir/myfile.json
{a:1,b:2,c:3}
{a:4,b:5,c:6}

select * from `myfile.json`

a, b, c
1, 2, 3
4, 5, 6

select * from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select * from `/mytdir/mysdir/myfile.json`

dir0, dir1 a, b, c
mytdir, mysdir, 1, 2, 3
mytdir, mysdir, 4, 5, 6



My proposal:

select * from `myfile.json`
select * from `/mysdir/myfile.json`
select * from `/mytdir/mysdir/myfile.json`
::all produce::
a, b, c
1, 2, 3
4, 5, 6

select dir0, a, b, c from `/mysdir/myfile.json`

dir0 a, b, c
mysdir, 1, 2, 3
mysdir, 4, 5, 6

select dir0, a, b, c from `/mytdir/mysdir/myfile.json`

dir0 a, b, c
mytdir, 1, 2, 3
mytdir, 4, 5, 6




On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha  wrote:

> Seems reasonable, as long as SELECT * also returns the dir# columns.
>
> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau 
> wrote:
>
> > Hey guys,
> >
> > I've been thinking that always showing dir# columns seems to alter data
> > returned from Drill depending on how you select the directory.  I'd
> propose
> > that we make it so that we only return dir# columns when they are
> > explicitly requested.
> >
> > Thoughts?
> >
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Neeraja Rentachintala

what do you mean by alter data returned based how you select directory.
can you give an example.

On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau  wrote:

> Hey guys,
>
> I've been thinking that always showing dir# columns seems to alter data
> returned from Drill depending on how you select the directory.  I'd propose
> that we make it so that we only return dir# columns when they are
> explicitly requested.
>
> Thoughts?
>

Re: Should we make dir* columns only exist when requested?

2015-04-23 Thread Aman Sinha

Seems reasonable, as long as SELECT * also returns the dir# columns.

On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau  wrote:

> Hey guys,
>
> I've been thinking that always showing dir# columns seems to alter data
> returned from Drill depending on how you select the directory.  I'd propose
> that we make it so that we only return dir# columns when they are
> explicitly requested.
>
> Thoughts?
>

[jira] [Created] (DRILL-2865) Drillbit runs out of memory on multiple consecutive CTAS

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2865:
---

 Summary: Drillbit runs out of memory on multiple consecutive CTAS
 Key: DRILL-2865
 URL: https://issues.apache.org/jira/browse/DRILL-2865
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Victoria Markman


Hardware configuration:
- single node
- 64GB RAM
Drill configuration
DRILL_MAX_DIRECT_MEMORY="8G"
DRILL_MAX_HEAP="4G"
`planner.enable_multiphase_agg` = false;
`store.parquet.block-size` = 134217728;
`planner.enable_mux_exchange` = false;
`exec.min_hash_table_size` = 67108864;
`planner.enable_hashagg` = true; 
`planner.width.max_per_node` = 23;


Aggregation query on TPCDS scale factor 1: 
select 
ss_sold_date_sk , 
ss_sold_time_sk , 
ss_item_sk , 
ss_customer_sk , 
ss_cdemo_sk, 
count(*) from store_sales
group by 
ss_sold_date_sk , 
ss_sold_time_sk , 
ss_item_sk , 
ss_customer_sk , 
ss_cdemo_sk
;

1. Executing CTAS with this query and store.format = 'parquet' fails on 
iteration #9 with this configuration consistently
2. Ran query by itself: 47 iterations successfully
3. Ran CTAS with this query and store.format = 'csv': - 30 iterations did not 
reproduce the problem

Attached:
  - drillbit.log
  - scripts.tar (contains script that reproduces OOM)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Should we make dir* columns only exist when requested?

2015-04-23 Thread Jacques Nadeau

Hey guys,

I've been thinking that always showing dir# columns seems to alter data
returned from Drill depending on how you select the directory.  I'd propose
that we make it so that we only return dir# columns when they are
explicitly requested.

Thoughts?

Re: Review Request 33291: DRILL-2782: 2-Core: Decide, implement behavior for transaction-related JDBC methods.

2015-04-23 Thread Parth Chandra


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33291/#review81411
---

Ship it!


LGTM

- Parth Chandra


On April 23, 2015, 9:16 p.m., Daniel Barclay wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33291/
> ---
> 
> (Updated April 23, 2015, 9:16 p.m.)
> 
> 
> Review request for drill, Mehant Baid and Parth Chandra.
> 
> 
> Bugs: DRILL-2782
> https://issues.apache.org/jira/browse/DRILL-2782
> 
> 
> Repository: drill-git
> 
> 
> Description
> ---
> 
> - Added unit test.
> - Added implementations of transaction-related methods:
>   - setAutoCommit - reject attempt to turn auto-commit off
>   - commit - reject when in auto-commit mode (which is always)
>   - rollback - reject when in auto-commit mode (which is always)
>   - other mode and metadata methods - roughly, report "no transactions"
> - Added method declarations with doc. comments in Drill-specific interface.
> - Overrode SQLLine's default transaction isolation level to Drill's 
> TRANSACTION_NONE.
> 
> 
> Diffs
> -
> 
>   distribution/src/resources/sqlline 0852fba 
>   distribution/src/resources/sqlline.bat 755526c 
>   exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnection.java a52644d 
>   exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionImpl.java 
> 3fdbf84 
>   
> exec/jdbc/src/test/java/org/apache/drill/jdbc/ConnectionTransactionMethodsTest.java
>  PRE-CREATION 
>   exec/jdbc/src/test/java/org/apache/drill/jdbc/DatabaseMetaDataTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/33291/diff/
> 
> 
> Testing
> ---
> 
> Ran new specific tests.
> 
> Ran existing tests; no new problems.
> 
> 
> Thanks,
> 
> Daniel Barclay
> 
>

Re: Review Request 33291: DRILL-2782: 2-Core: Decide, implement behavior for transaction-related JDBC methods.

2015-04-23 Thread Daniel Barclay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33291/
---

(Updated April 23, 2015, 9:16 p.m.)


Review request for drill, Mehant Baid and Parth Chandra.


Changes
---

Adjusted SQLLine scripts to avoid error message; updated a few missed comments.


Bugs: DRILL-2782
https://issues.apache.org/jira/browse/DRILL-2782


Repository: drill-git


Description (updated)
---

- Added unit test.
- Added implementations of transaction-related methods:
  - setAutoCommit - reject attempt to turn auto-commit off
  - commit - reject when in auto-commit mode (which is always)
  - rollback - reject when in auto-commit mode (which is always)
  - other mode and metadata methods - roughly, report "no transactions"
- Added method declarations with doc. comments in Drill-specific interface.
- Overrode SQLLine's default transaction isolation level to Drill's 
TRANSACTION_NONE.


Diffs (updated)
-

  distribution/src/resources/sqlline 0852fba 
  distribution/src/resources/sqlline.bat 755526c 
  exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnection.java a52644d 
  exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionImpl.java 
3fdbf84 
  
exec/jdbc/src/test/java/org/apache/drill/jdbc/ConnectionTransactionMethodsTest.java
 PRE-CREATION 
  exec/jdbc/src/test/java/org/apache/drill/jdbc/DatabaseMetaDataTest.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/33291/diff/


Testing
---

Ran new specific tests.

Ran existing tests; no new problems.


Thanks,

Daniel Barclay

[jira] [Created] (DRILL-2864) Unable to cast string literal with the valid value in ISO 8601 format to interval

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2864:
---

 Summary: Unable to cast string literal with the valid value in ISO 
8601 format to interval
 Key: DRILL-2864
 URL: https://issues.apache.org/jira/browse/DRILL-2864
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Affects Versions: 0.9.0
Reporter: Victoria Markman
Assignee: Daniel Barclay (Drill)


{code}
0: jdbc:drill:schema=dfs> select cast('P1D' as interval day) from t1;
Query failed: PARSE ERROR: From line 1, column 8 to line 1, column 34: Cast 
function cannot convert value of type CHAR(3) to type INTERVAL DAY

[744f1f35-f8c5-46ba-80f9-0efd87036903 on atsqa4-134.qa.lab:31010]
Error: exception while executing query: Failure while executing query. 
(state=,code=0)
{code}

Workaround: cast to varchar.
{code}
0: jdbc:drill:schema=dfs> select cast(cast('P1D' as varchar(3)) as interval 
day) from t1;
++
|   EXPR$0   |
++
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
++
10 rows selected (0.191 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 33442: DRILL-2811: Allow direct connection to drillbit from DrillClient

2015-04-23 Thread Hanifi Gunes


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33442/#review81398
---

Ship it!


Ship It!

- Hanifi Gunes


On April 23, 2015, 8:51 p.m., Parth Chandra wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33442/
> ---
> 
> (Updated April 23, 2015, 8:51 p.m.)
> 
> 
> Review request for drill, Daniel Barclay, Hanifi Gunes, and Mehant Baid.
> 
> 
> Repository: drill-git
> 
> 
> Description
> ---
> 
> DRILL-2811: Allow direct connection to drillbit from DrillClient
> 
> 
> Diffs
> -
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java 
> 0d29f60 
>   exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionConfig.java 
> de08cda 
>   exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionImpl.java 
> 3fdbf84 
> 
> Diff: https://reviews.apache.org/r/33442/diff/
> 
> 
> Testing
> ---
> 
> Tested using sqlline
> 
> As the connection string use : 
> 
> sqlline -u "jdbc:drill:local=localhost:31010"  -n admin -p admin
> 
> 
> Thanks,
> 
> Parth Chandra
> 
>

[jira] [Created] (DRILL-2863) Slow code generation/compilation(/scalar replacement?) for getColumns(...) query

2015-04-23 Thread Daniel Barclay (Drill) (JIRA)

Daniel Barclay (Drill) created DRILL-2863:
-

 Summary: Slow code generation/compilation(/scalar replacement?) 
for getColumns(...) query
 Key: DRILL-2863
 URL: https://issues.apache.org/jira/browse/DRILL-2863
 Project: Apache Drill
  Issue Type: Bug
Reporter: Daniel Barclay (Drill)


Calling Drill's JDBC driver's DatabaseMetaData.getColumns(...) method seems to 
take an unusually long of time to execute.

Unit tests TestJdbcMetadata and 
Drill2128GetColumnsDataTypeNotTypeCodeIntBugsTest have gotten slower recently, 
seemingly in several increments:  They needed their timeouts increased, from 
around 50 s to 90 s, and then to 120 s, and that 120 s timeout is not long 
enough for reliable runs (at least on my machine).

>From looking at the logs (with sufficiently verbose logging), it seems that 
>the large SQL query in the implementation of getColumns() (currently in 
>org.apache.drill.jdbc.MetaImpl) is leads to 513 kB of generated code.

That half a megabyte of generated Java code frequently takes around 110 seconds 
to compile (on my machine). 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 33442: DRILL-2811: Allow direct connection to drillbit from DrillClient

2015-04-23 Thread Parth Chandra


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33442/
---

(Updated April 23, 2015, 8:51 p.m.)


Review request for drill, Daniel Barclay, Hanifi Gunes, and Mehant Baid.


Changes
---

Changed the connection string to :
sqlline -u "jdbc:drill:drillbit=localhost:31010"  -n admin -p admin

Addressed other review comments.


Repository: drill-git


Description
---

DRILL-2811: Allow direct connection to drillbit from DrillClient


Diffs (updated)
-

  exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java 
0d29f60 
  exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionConfig.java 
de08cda 
  exec/jdbc/src/main/java/org/apache/drill/jdbc/DrillConnectionImpl.java 
3fdbf84 

Diff: https://reviews.apache.org/r/33442/diff/


Testing
---

Tested using sqlline

As the connection string use : 

sqlline -u "jdbc:drill:local=localhost:31010"  -n admin -p admin


Thanks,

Parth Chandra

[jira] [Created] (DRILL-2861) enhance drill profile file management

2015-04-23 Thread Chun Chang (JIRA)

Chun Chang created DRILL-2861:
-

 Summary: enhance drill profile file management
 Key: DRILL-2861
 URL: https://issues.apache.org/jira/browse/DRILL-2861
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Flow
Affects Versions: 0.9.0
Reporter: Chun Chang
Assignee: Chris Westin


We need to manage profile files better. Currently each query creates one 
profile file on the local filesystem of the forman node. You can imagine how 
this can quickly get out of hand in a production environment.

We need:

1. be able to turn on and off profiling, preferably in the fly
2. profiling files should be managed the same as log files
3. able to change default file location, for example on a distributed filesystem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2862) Convert_to/Convert_From throw assertion when an incorrect encoding type is specified

2015-04-23 Thread Neeraja (JIRA)

Neeraja created DRILL-2862:
--

 Summary: Convert_to/Convert_From throw assertion when an incorrect 
encoding type is specified
 Key: DRILL-2862
 URL: https://issues.apache.org/jira/browse/DRILL-2862
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Reporter: Neeraja
Assignee: Daniel Barclay (Drill)


Below is the error from SQLLine. Replacing UTF-8 to UTF8 works fine.
The error message need to accurately represent the problem.


0: jdbc:drill:> select Convert_from(t.address.state,'UTF-8') from customers t 
limit 10;
Query failed: AssertionError: 

Error: exception while executing query: Failure while executing query. 
(state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Apache Drill Plan Syntax

2015-04-23 Thread Alexander Zarei (Google Docs)

Alexander Zarei added comments to Apache Drill Plan Syntax  
(https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit?disco=AZxFeOI)


.
Alexander Zarei
| From there
Afterward,

Reply (Reply  
)
Open  
(https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit?disco=AZxFeOc)


.
Alexander Zarei
| handed to a query parse
The query is submitted by a client and received by some component? What is  
that component?


Reply (Reply  
)
Open  
(https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit?disco=AZxFeOY)


.
Alexander Zarei
| common vocabulary
terminology?

Reply (Reply  
)
Open  
(https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit?disco=AZxFeOM)


You received this email because you are a participant in the updated  
comment threads.
Change  
(https://docs.google.com/document/docos/notify?id=1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I&title=Apache+Drill+Plan+Syntax)  
what Google sends you.

You cannot reply to this email.

[jira] [Created] (DRILL-2860) Unable to cast integer column from parquet file to interval day

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2860:
---

 Summary: Unable to cast integer column from parquet file to 
interval day
 Key: DRILL-2860
 URL: https://issues.apache.org/jira/browse/DRILL-2860
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Reporter: Victoria Markman
Assignee: Daniel Barclay (Drill)


I can cast numeric literal to "interval day":
{code}
0: jdbc:drill:schema=dfs> select cast(1 as interval day) from t1;
++
|   EXPR$0   |
++
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
| P1D|
++
10 rows selected (0.122 seconds)
{code}

Get an error when I'm trying to do the same from parquet file:
{code}
0: jdbc:drill:schema=dfs> select cast(a1 as interval day) from t1 where a1 = 1;
Query failed: SYSTEM ERROR: Invalid format: "1"

Fragment 0:0

[6a4adf04-f3db-4feb-8010-ebc3bfced1e3 on atsqa4-134.qa.lab:31010]

  (java.lang.IllegalArgumentException) Invalid format: "1"
org.joda.time.format.PeriodFormatter.parseMutablePeriod():326
org.joda.time.format.PeriodFormatter.parsePeriod():304
org.joda.time.Period.parse():92
org.joda.time.Period.parse():81
org.apache.drill.exec.test.generated.ProjectorGen180.doEval():77
org.apache.drill.exec.test.generated.ProjectorGen180.projectRecords():62
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():170
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():130
org.apache.drill.exec.record.AbstractRecordBatch.next():144

org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():118
org.apache.drill.exec.physical.impl.BaseRootExec.next():74
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():80
org.apache.drill.exec.physical.impl.BaseRootExec.next():64
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():198
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():192
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1469
org.apache.drill.exec.work.fragment.FragmentExecutor.run():192
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745

Error: exception while executing query: Failure while executing query. 
(state=,code=0)
{code}

If I try casting a1 to an integer I run into drill-2859



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2859) Unexpected exception in the query with an interval data type

2015-04-23 Thread Victoria Markman (JIRA)

Victoria Markman created DRILL-2859:
---

 Summary: Unexpected exception in the query with an interval data 
type
 Key: DRILL-2859
 URL: https://issues.apache.org/jira/browse/DRILL-2859
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 0.9.0
Reporter: Victoria Markman
Assignee: Jinfeng Ni


{code}
0: jdbc:drill:schema=dfs> select cast(cast(a1 as int) as interval day) from t1 
where a1 = 1;
Query failed: SYSTEM ERROR: Unexpected exception during fragment 
initialization: todo: implement syntax 
SPECIAL(Reinterpret(*(Reinterpret(CAST(CAST($0):INTEGER):DECIMAL(2, 0)), 
8640)))


[5119315b-dd73-432f-ab93-49e76e9165f6 on atsqa4-134.qa.lab:31010]

  (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception 
during fragment initialization: todo: implement syntax 
SPECIAL(Reinterpret(*(Reinterpret(CAST(CAST($0):INTEGER):DECIMAL(2, 0)), 
8640)))
org.apache.drill.exec.work.foreman.Foreman.run():212
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745
  Caused By (java.lang.AssertionError) todo: implement syntax 
SPECIAL(Reinterpret(*(Reinterpret(CAST(CAST($0):INTEGER):DECIMAL(2, 0)), 
8640)))
org.apache.drill.exec.planner.logical.DrillOptiq$RexToDrill.visitCall():182
org.apache.drill.exec.planner.logical.DrillOptiq$RexToDrill.visitCall():73
org.apache.calcite.rex.RexCall.accept():107
org.apache.drill.exec.planner.logical.DrillOptiq.toDrill():70

org.apache.drill.exec.planner.common.DrillProjectRelBase.getProjectExpressions():111
org.apache.drill.exec.planner.physical.ProjectPrel.getPhysicalOperator():57
org.apache.drill.exec.planner.physical.ScreenPrel.getPhysicalOperator():51

org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToPop():376
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():157
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():167
org.apache.drill.exec.work.foreman.Foreman.runSQL():773
org.apache.drill.exec.work.foreman.Foreman.run():203
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745

Error: exception while executing query: Failure while executing query. 
(state=,code=0)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2858) Refactor hash expression construction in InsertLocalExchangeVisitor and PrelUtil into one place

2015-04-23 Thread Venki Korukanti (JIRA)

Venki Korukanti created DRILL-2858:
--

 Summary: Refactor hash expression construction in 
InsertLocalExchangeVisitor and PrelUtil into one place
 Key: DRILL-2858
 URL: https://issues.apache.org/jira/browse/DRILL-2858
 Project: Apache Drill
  Issue Type: Bug
Reporter: Venki Korukanti
Assignee: Venki Korukanti


Currently there are two place where we construct the hash expression based on 
the partition fields:
1. InsertLocalExchangeVistor (generates RexExpr type)
2. PRelUtil.getHashExpression (generate LogicalExpression type)

Having this logic in two places makes them prone to errors and they can easily 
go out of sync causing hard to debug verification failures.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2857) Update the StreamingAggBatch current workspace record counter variable type to "long" from current type "int"

2015-04-23 Thread Venki Korukanti (JIRA)

Venki Korukanti created DRILL-2857:
--

 Summary: Update the StreamingAggBatch current workspace record 
counter variable type to "long" from current type "int"
 Key: DRILL-2857
 URL: https://issues.apache.org/jira/browse/DRILL-2857
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 0.8.0
Reporter: Venki Korukanti
Assignee: Venki Korukanti
 Fix For: 0.9.0


This is causing invalid results in cases where the incoming batch has more than 
(2^31) - 1 records due to overflow issues.

Example query: (make sure the nested query returns more than (2^31-1) records.
{code}
SELECT count(*) FROM 
  (SELECT L_ORDERKEY, 
  L_PARTKEY, 
  L_SUPPKEY, 
  count(*), 
  count(l_quantity) 
FROM dfs.`lineitem` 
   GROUP BY 
  L_ORDERKEY, 
  L_PARTKEY, 
  L_SUPPKEY
  );
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2856) StreamingAggBatch goes into infinite loop due to state management issues

2015-04-23 Thread Venki Korukanti (JIRA)

Venki Korukanti created DRILL-2856:
--

 Summary: StreamingAggBatch goes into infinite loop due to state 
management issues
 Key: DRILL-2856
 URL: https://issues.apache.org/jira/browse/DRILL-2856
 Project: Apache Drill
  Issue Type: Bug
Reporter: Venki Korukanti
Assignee: Steven Phillips






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2855) Fix invalid result issues with StreamingAggBatch

2015-04-23 Thread Venki Korukanti (JIRA)

Venki Korukanti created DRILL-2855:
--

 Summary: Fix invalid result issues with StreamingAggBatch
 Key: DRILL-2855
 URL: https://issues.apache.org/jira/browse/DRILL-2855
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 0.8.0
Reporter: Venki Korukanti
Assignee: Venki Korukanti
 Fix For: 0.9.0


There are two issues that are causing invalid results:
1. In some conditions we are failing to add the record to current aggregation 
workspace around batch boundary or output batch is full.
2. Incorrectly cleaning up the previous batch. Currently we keep a reference to 
the current batch in "previous" and try to get the next incoming batch which 
has more than zero records or there are no incoming batches. If the next 
incoming batch has zero records, we are cleaning up the "previous" batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2854) KVGEN function document needs to be corrected

2015-04-23 Thread Khurram Faraaz (JIRA)

Khurram Faraaz created DRILL-2854:
-

 Summary: KVGEN function document needs to be corrected
 Key: DRILL-2854
 URL: https://issues.apache.org/jira/browse/DRILL-2854
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.0
Reporter: Khurram Faraaz
Assignee: Bridget Bevens


The sample JSON data snippet here (http://drill.apache.org/docs/kvgen/) needs 
to be corrected and additionally the SQL query to produce the output listed in 
that section, can be added to that section.

Here is the correct representation of the data in the JSON data file. Currently 
the text on documentation page is missing the "rec1" identifier, and the 
opening and closing braces for "rec1" are also missing.

{code}
[root@centos-01 bin]# hadoop fs -cat /tmp/simplemaps.json
{"rec1":{"a": "valA", "b": "valB"}}
{"rec1":{"c": "valC", "d": "valD"}}
{code}

Query that produces the desired output, like the one mentioned in the 
documentation section for KVGEN function.

{code}
0: jdbc:drill:> select kvgen(rec1) from `simplemaps.json`;
++
|   EXPR$0   |
++
| [{"key":"a","value":"valA"},{"key":"b","value":"valB"}] |
| [{"key":"c","value":"valC"},{"key":"d","value":"valD"}] |
++
2 rows selected (0.201 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-2761) ParquetGroupScan copy constructor only copy reference, leading to out-sync ParquetGroupScan instance.

2015-04-23 Thread Jinfeng Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinfeng Ni resolved DRILL-2761.
---
Resolution: Fixed

commit id: e462d14e63e4b396935f611cba5183c6f5d62a8f

> ParquetGroupScan copy constructor only copy reference, leading to out-sync 
> ParquetGroupScan instance.
> -
>
> Key: DRILL-2761
> URL: https://issues.apache.org/jira/browse/DRILL-2761
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
> Attachments: 
> 0003-DRILL-2761-ParquetGroupScan-s-copy-constructor-shoul.patch
>
>
> ParquetGroupScan has one copy constructor, which will be used in project 
> pushdown rule and partition pruning rule to clone a modified version of 
> original ParquetGroupScan instance. However, the copy constructor only copy 
> the reference to several Collections, this means that if the cloned instance 
> modify those collections, it will also modify the contents of the collections 
> in the original ParquetGroupScan instance, leading to an invalid status for 
> the original ParquetGroupScan instance.  Such invalid status would lead 
> incorrect query result. 
> For instance, consider query:
> {code}
> select O_ORDERKEY,O_CUSTKEY,O_CLERK,O_COMMENT,dir0 
> from `/drill/testdata/partition_pruning/dfs/orders` 
> where (dir0=1993)
> {code}
> Assume the data is partitioned with year (1993, 1994, 1995). Depending on the 
> order of RelOptRule's firing, a ParquetGroupScan could have out-sync of 
> "rowGroupInfos" list and "entries" list, this will make optimizer thinks that 
> the partition filter is pushed, such that "entries" is modified and filter is 
> removed from the plan, yet the "rowGroupInfors" is still in the original one. 
>   This will make the query return unwanted rows back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-1384) Rebase Drill on Calcite v1.0

2015-04-23 Thread Jinfeng Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinfeng Ni resolved DRILL-1384.
---
Resolution: Fixed

commit id: 

e99f270322ec17580e728bf28a20b978a7fbdf8b


> Rebase Drill on Calcite v1.0
> 
>
> Key: DRILL-1384
> URL: https://issues.apache.org/jira/browse/DRILL-1384
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Affects Versions: 0.5.0
>Reporter: Jacques Nadeau
>Assignee: Jinfeng Ni
>Priority: Blocker
> Fix For: 0.9.0
>
>
> This is a tracking item to ensure that all changes that are done to Drill's 
> Optiq branch are pushed back (as possible) into the mainline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-2140) RPC Error querying JSON with empty nested maps

2015-04-23 Thread Sudheesh Katkam (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-2140.

Resolution: Fixed

> RPC Error querying JSON with empty nested maps
> --
>
> Key: DRILL-2140
> URL: https://issues.apache.org/jira/browse/DRILL-2140
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 0.7.0
> Environment: Centos 4 node MapR cluster
>Reporter: Andries Engelbrecht
>Assignee: Sudheesh Katkam
> Fix For: 1.0.0
>
> Attachments: drillbit.log
>
>
> When querying large number of documents in multiple directories with multiple 
> JSON files in each, and some documents have no top level map that is used for 
> a predicate, Drill produces a RPC error in the log.
> Query
> {code}
> > select t.retweeted_status.`user`.name as name, 
> > count(t.retweeted_status.favorited) as rt_count from `./nfl` t where 
> > t.retweeted_status.`user`.name is not null group by 
> > t.retweeted_status.`user`.name order by count(t.retweeted_status.favorited) 
> > desc limit 10;
> Query failed: Query failed: Failure while running fragment., index: 0, 
> length: 1 (expected: range(0, 0)) [ b96e3bfa-74c9-4b78-886b-9a2c3fc4ea9b on 
> se-node13.se.lab:31010 ]
> [ b96e3bfa-74c9-4b78-886b-9a2c3fc4ea9b on se-node13.se.lab:31010 ]
> {code}
> Drillbit log attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

41 matches

Mail list logo