Re: NumberFormatException with cast to double?

2016-03-10 Thread Abdel Hakim Deneche
Looks like the COALESCE function is the source of the problem. Passing a
double (0.0) instead of an int (0) as a second expression solved the
problem for me:

CAST(COALESCE(t_total, 0.0) AS double)


On Fri, Mar 11, 2016 at 12:45 AM, Matt  wrote:

> ~~~
> 00-01  Project(date_tm=[CAST($23):TIMESTAMP(0)],
> id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
> "ISO-8859-1$en_US$primary"], id_2=[CAST($15):VARCHAR(1) CHARACTER SET
> "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"],
> id_3=[CAST($33):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
> "ISO-8859-1$en_US$primary"], b_total=[CAST(CASE(IS NOT NULL($4), $4,
> 0)):BIGINT], t_total=[CAST(CASE(IS NOT NULL($31), $31, 0)):DOUBLE],
> h_total=[CAST(CASE(IS NOT NULL($40), $40, 0)):BIGINT],
> b_small=[CAST(CASE(IS NOT NULL($36), $36, 0)):BIGINT],
> t_small=[CAST(CASE(IS NOT NULL($14), $14, 0)):DOUBLE],
> h_small=[CAST(CASE(IS NOT NULL($38), $38, 0)):BIGINT],
> b_18000=[CAST(CASE(IS NOT NULL($32), $32, 0)):BIGINT],
> t_18000=[CAST(CASE(IS NOT NULL($24), $24, 0)):DOUBLE],
> h_18000=[CAST(CASE(IS NOT NULL($27), $27, 0)):BIGINT],
> b_12000=[CAST(CASE(IS NOT NULL($30), $30, 0)):BIGINT],
> t_12000=[CAST(CASE(IS NOT NULL($28), $28, 0)):DOUBLE],
> h_12000=[CAST(CASE(IS NOT NULL($20), $20, 0)):BIGINT], b_6000=[CAST(CASE(IS
> NOT NULL($41), $41, 0)):BIGINT], t_6000=[CAST(CASE(IS NOT NULL($37), $37,
> 0)):DOUBLE], h_6000=[CAST(CASE(IS NOT NULL($29), $29, 0)):BIGINT],
> b_3000=[CAST(CASE(IS NOT NULL($17), $17, 0)):BIGINT], t_3000=[CAST(CASE(IS
> NOT NULL($7), $7, 0)):DOUBLE], h_3000=[CAST(CASE(IS NOT NULL($1), $1,
> 0)):BIGINT], b_2000=[CAST(CASE(IS NOT NULL($26), $26, 0)):BIGINT],
> t_2000=[CAST(CASE(IS NOT NULL($34), $34, 0)):DOUBLE], h_2000=[CAST(CASE(IS
> NOT NULL($10), $10, 0)):BIGINT], b_1500=[CAST(CASE(IS NOT NULL($42), $42,
> 0)):BIGINT], t_1500=[CAST(CASE(IS NOT NULL($13), $13, 0)):DOUBLE],
> h_1500=[CAST(CASE(IS NOT NULL($3), $3, 0)):BIGINT], b_1250=[CAST(CASE(IS
> NOT NULL($21), $21, 0)):BIGINT], t_1250=[CAST(CASE(IS NOT NULL($25), $25,
> 0)):DOUBLE], h_1250=[CAST(CASE(IS NOT NULL($16), $16, 0)):BIGINT],
> b_1000=[CAST(CASE(IS NOT NULL($12), $12, 0)):BIGINT], t_1000=[CAST(CASE(IS
> NOT NULL($19), $19, 0)):DOUBLE], h_1000=[CAST(CASE(IS NOT NULL($6), $6,
> 0)):BIGINT], b_750=[CAST(CASE(IS NOT NULL($9), $9, 0)):BIGINT],
> t_750=[CAST(CASE(IS NOT NULL($0), $0, 0)):DOUBLE], h_750=[CAST(CASE(IS NOT
> NULL($5), $5, 0)):BIGINT], b_500=[CAST(CASE(IS NOT NULL($2), $2,
> 0)):BIGINT], t_500=[CAST(CASE(IS NOT NULL($8), $8, 0)):DOUBLE],
> h_500=[CAST(CASE(IS NOT NULL($39), $39, 0)):BIGINT], b_0=[CAST(CASE(IS NOT
> NULL($18), $18, 0)):BIGINT], t_0=[CAST(CASE(IS NOT NULL($35), $35,
> 0)):DOUBLE], EXPR$42=[CAST(CASE(IS NOT NULL($22), $22, 0)):BIGINT])
> 00-02Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://es05:54310/caspr/csv/smalltest.csv, numFiles=1,
> columns=[`date_tm`, `id_1`, `id_2`, `id_3`, `b_total`, `t_total`,
> `h_total`, `b_small`, `t_small`, `h_small`, `b_18000`, `t_18000`,
> `h_18000`, `b_12000`, `t_12000`, `h_12000`, `b_6000`, `t_6000`, `h_6000`,
> `b_3000`, `t_3000`, `h_3000`, `b_2000`, `t_2000`, `h_2000`, `b_1500`,
> `t_1500`, `h_1500`, `b_1250`, `t_1250`, `h_1250`, `b_1000`, `t_1000`,
> `h_1000`, `b_750`, `t_750`, `h_750`, `b_500`, `t_500`, `h_500`, `b_0`,
> `t_0`, `h_0`], files=[hdfs://es05:54310/caspr/csv/smalltest.csv]]])
> ~~~
>
> ~~~
> {
>   "head" : {
> "version" : 1,
> "generator" : {
>   "type" : "ExplainHandler",
>   "info" : ""
> },
> "type" : "APACHE_DRILL_PHYSICAL",
> "options" : [ {
>   "kind" : "STRING",
>   "type" : "SESSION",
>   "name" : "store.format",
>   "string_val" : "parquet"
> }, {
>   "kind" : "BOOLEAN",
>   "type" : "SESSION",
>   "name" : "exec.errors.verbose",
>   "bool_val" : true
> } ],
> "queue" : 0,
> "resultMode" : "EXEC"
>   },
>   "graph" : [ {
> "pop" : "fs-scan",
> "@id" : 2,
> "userName" : "hduser",
> "files" : [ "hdfs://es05:54310/caspr/csv/smalltest.csv" ],
> "storage" : {
>   "type" : "file",
>   "enabled" : true,
>   "connection" : "hdfs://es05:54310",
>   "workspaces" : {
> "root" : {
>   "location" : "/",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "tmp" : {
>   "location" : "/tmp/",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "caspr" : {
>   "location" : "/caspr",
>   "writable" : true,
>   "defaultInputFormat" : "csv"
> }
>   "ref" : "`t_3000`",
>   "expr" : "cast( ( ( if (isnotnull(`t_3000`)  ) then (`t_3000` )
> else (0 )  end  )  ) as FLOAT8 )"
> }, {
>   "ref" : "`h_3000`",
>   "expr" : "cast( ( ( if (isnotnull(`h_3000`)  ) then (`h_3000` )
> else (0 )  end  )  ) as BIGINT )"
> }, {
>   "ref" : "`b_2000`",
>   "expr" : "cast( ( ( if (isnotnull(`b_2000`)  ) then (`b_2000` )
> else (0 )  end  )

Re: Question on nested JSON behavior

2016-03-10 Thread Aditya
This depends how do you see the "batter" field.

If you see it as an array, the query should be run as "select id,
t.batters.batter[0].id"

But you can also view "batter" as a repeated map, which would mean "id" is
a child field of this map, repeated a number of time and projecting "
batters.batter.id" should return this field for each repeated instance.

This is what MongoDB is doing.

On Thu, Mar 10, 2016 at 7:54 PM, Jiang Wu  wrote:

> For the those who are curious, here is what Mongo returns for the same
> conceptual query:
>
> > db.t1.find({},{"batters":1, "_id":0});
> { "batters" : { "batter" : [ { "id" : "1001", "type" : "Regular" }, { "id"
> : "1002", "type" : "Chocolate" }, { "id" : "1003", "type" : "Blueberry" },
> { "id" : "1004", "type" : "Devil's Food" } ] } }
> { "batters" : { "batter" : [ { "id" : "2001", "type" : "Regular" } ] } }
> { "batters" : { "batter" : [ { "id" : "3001", "type" : "Regular" }, { "id"
> : "3002", "type" : "Chocolate" } ] } }
>
> > db.t1.find({},{"batters.batter":1, "_id":0});
> { "batters" : { "batter" : [ { "id" : "1001", "type" : "Regular" }, { "id"
> : "1002", "type" : "Chocolate" }, { "id" : "1003", "type" : "Blueberry" },
> { "id" : "1004", "type" : "Devil's Food" } ] } }
> { "batters" : { "batter" : [ { "id" : "2001", "type" : "Regular" } ] } }
> { "batters" : { "batter" : [ { "id" : "3001", "type" : "Regular" }, { "id"
> : "3002", "type" : "Chocolate" } ] } }
>
> > db.t1.find({},{"batters.batter.id":1, "_id":0});
> { "batters" : { "batter" : [ { "id" : "1001" }, { "id" : "1002" }, { "id"
> : "1003" }, { "id" : "1004" } ] } }
> { "batters" : { "batter" : [ { "id" : "2001" } ] } }
> { "batters" : { "batter" : [ { "id" : "3001" }, { "id" : "3002" } ] } }
>
> Each returned document is always a JSON document from the root.  Hence, no
> ambiguity in the last case.  It is returning the inner most array with each
> array element having a single map key.
>
> To produce the same conceptual results, all values in the inner most array
> will need to returned for each row.
>
> -- Jiang
>
> -Original Message-
> From: Neeraja Rentachintala [mailto:nrentachint...@maprtech.com]
> Sent: Thursday, March 10, 2016 4:43 PM
> To: user@drill.apache.org
> Subject: Re: Question on nested JSON behavior
>
> Actually I agree with Jiang. The result does seem unintuitive. If it is a
> file with just a list, it does still make sense to return the ids in that
> list as an array unless the user has configured Drill to automatically
> flatten the first level.
> Does anyone know how does the other systems behave for this use case? (for
> ex: Mongo)
>
>
>
> On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
> wrote:
>
> > Hi Jiang,
> >
> > Think of it this way: If you had a file that was just the list:
> >
> > {"id":"1001","type":"Regular"}
> > {"id":"1002","type":"Chocolate"}
> > {"id":"1003","type":"Blueberry"}
> > {"id":"1004","type":"Devil's Food"}
> >
> > What would you like it to return when you query:
> >
> > select id from dfs.`/path/to/sample_file.json`;
> >
> > ?
> >
> > When you enter the query that you're asking about, you're indicating
> > exactly that structure of data. Does this explanation make sense?
> >
> > Best,
> > Nathan
> >
> > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> wrote:
> > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > structure:
> > >
> > >   {  ...
> > > "batters":
> > >   {
> > > "batter":
> > >   [
> > > { "id": "1001", "type": "Regular" },
> > > { "id": "1002", "type": "Chocolate" },
> > > { "id": "1003", "type": "Blueberry" },
> > > { "id": "1004", "type": "Devil's Food" }
> > >   ]
> > >   },
> > > ...
> > >   }
> > >
> > > Now running a few sample queries against the above data:
> > >
> > >
> > > A)  select "batters" returns expected results, which are the values
> > of "batters" from each row.
> > >
> > > 0: jdbc:drill:zk=local> select batters from
> > > dfs.`c:\tmp\sample.json`;
> > > +-+
> > > | batters |
> > > +-+
> > > |
> > {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> > te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}]} |
> > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > |
> > {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> > te"}]}
> > |
> > > +-+
> > > 3 rows selected (0.243 seconds)
> > >
> > >
> > > B)  select "batters.batter" also returns the expected results,
> which
> > are the array values for "batters.batter" from each row.
> > >
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter from
> > dfs.`c:\tmp\sample.json` t;
> > > ++
> > > | EXPR$0 |
> > > ++
> > > |
> > [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id"
> > :"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}] |
> > > | [{"id":"1001","type":"Regular"}] |
> > > | [{"id":"1001","type":"Reg

RE: Question on nested JSON behavior

2016-03-10 Thread Jiang Wu
For the those who are curious, here is what Mongo returns for the same 
conceptual query:

> db.t1.find({},{"batters":1, "_id":0});
{ "batters" : { "batter" : [ { "id" : "1001", "type" : "Regular" }, { "id" : 
"1002", "type" : "Chocolate" }, { "id" : "1003", "type" : "Blueberry" }, { "id" 
: "1004", "type" : "Devil's Food" } ] } }
{ "batters" : { "batter" : [ { "id" : "2001", "type" : "Regular" } ] } }
{ "batters" : { "batter" : [ { "id" : "3001", "type" : "Regular" }, { "id" : 
"3002", "type" : "Chocolate" } ] } }

> db.t1.find({},{"batters.batter":1, "_id":0});
{ "batters" : { "batter" : [ { "id" : "1001", "type" : "Regular" }, { "id" : 
"1002", "type" : "Chocolate" }, { "id" : "1003", "type" : "Blueberry" }, { "id" 
: "1004", "type" : "Devil's Food" } ] } }
{ "batters" : { "batter" : [ { "id" : "2001", "type" : "Regular" } ] } }
{ "batters" : { "batter" : [ { "id" : "3001", "type" : "Regular" }, { "id" : 
"3002", "type" : "Chocolate" } ] } }

> db.t1.find({},{"batters.batter.id":1, "_id":0});
{ "batters" : { "batter" : [ { "id" : "1001" }, { "id" : "1002" }, { "id" : 
"1003" }, { "id" : "1004" } ] } }
{ "batters" : { "batter" : [ { "id" : "2001" } ] } }
{ "batters" : { "batter" : [ { "id" : "3001" }, { "id" : "3002" } ] } }

Each returned document is always a JSON document from the root.  Hence, no 
ambiguity in the last case.  It is returning the inner most array with each 
array element having a single map key.

To produce the same conceptual results, all values in the inner most array will 
need to returned for each row.

-- Jiang

-Original Message-
From: Neeraja Rentachintala [mailto:nrentachint...@maprtech.com] 
Sent: Thursday, March 10, 2016 4:43 PM
To: user@drill.apache.org
Subject: Re: Question on nested JSON behavior

Actually I agree with Jiang. The result does seem unintuitive. If it is a file 
with just a list, it does still make sense to return the ids in that list as an 
array unless the user has configured Drill to automatically flatten the first 
level.
Does anyone know how does the other systems behave for this use case? (for
ex: Mongo)



On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
wrote:

> Hi Jiang,
>
> Think of it this way: If you had a file that was just the list:
>
> {"id":"1001","type":"Regular"}
> {"id":"1002","type":"Chocolate"}
> {"id":"1003","type":"Blueberry"}
> {"id":"1004","type":"Devil's Food"}
>
> What would you like it to return when you query:
>
> select id from dfs.`/path/to/sample_file.json`;
>
> ?
>
> When you enter the query that you're asking about, you're indicating 
> exactly that structure of data. Does this explanation make sense?
>
> Best,
> Nathan
>
> On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu  wrote:
> > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> structure:
> >
> >   {  ...
> > "batters":
> >   {
> > "batter":
> >   [
> > { "id": "1001", "type": "Regular" },
> > { "id": "1002", "type": "Chocolate" },
> > { "id": "1003", "type": "Blueberry" },
> > { "id": "1004", "type": "Devil's Food" }
> >   ]
> >   },
> > ...
> >   }
> >
> > Now running a few sample queries against the above data:
> >
> >
> > A)  select "batters" returns expected results, which are the values
> of "batters" from each row.
> >
> > 0: jdbc:drill:zk=local> select batters from 
> > dfs.`c:\tmp\sample.json`;
> > +-+
> > | batters |
> > +-+
> > |
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> Food"}]} |
> > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > |
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"}]}
> |
> > +-+
> > 3 rows selected (0.243 seconds)
> >
> >
> > B)  select "batters.batter" also returns the expected results, which
> are the array values for "batters.batter" from each row.
> >
> >
> > 0: jdbc:drill:zk=local> select t.batters.batter from
> dfs.`c:\tmp\sample.json` t;
> > ++
> > | EXPR$0 |
> > ++
> > |
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id"
> :"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> Food"}] |
> > | [{"id":"1001","type":"Regular"}] | 
> > | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}] 
> > | |
> > ++
> > 3 rows selected (0.198 seconds)
> >
> >
> > C)  select "batters.batter.id" returns something unexpected:
> >
> > 0: jdbc:drill:zk=local> select t.batters.batter.id from
> dfs.`c:\tmp\sample.json` t;
> > +-+
> > | EXPR$0  |
> > +-+
> > | 1001|
> > | 1002|
> > | 1003|
> > +-+
> >
> > The above result doesn't make sense.  The result looks like the 3 
> > values
> from row 1. Should the result be the following instead?
> >
> > +-+
> > | EXPR$0  |
> > +-+
> > | [1001, 1002, 1003, 1004]|
> > | [1001]|
> > | [1001, 1002]|
> > +-+
> >
>

Re: Question on nested JSON behavior

2016-03-10 Thread Steven Phillips
In the mean time, if you want to be able to query this data, there are
basically two ways to do this in Drill. One way is to give an index into
the array:

select id, t.batters.batter[0].id

This will only look at the first entry in each array.

The other way is to flatten the data:

select t.batter.id from (select flatten(batters.batter)) as batter from
`sample.json`) t;

On Thu, Mar 10, 2016 at 6:25 PM, Steven Phillips  wrote:

> Yeah, it's definitely a bug. Could you please file a jira?
>
> On Thu, Mar 10, 2016 at 6:19 PM, Jiang Wu  wrote:
>
>> Here are the complete examples to illustrate the problem:
>>
>> 0: jdbc:drill:zk=local> select id, t.batters.batter from
>> dfs.`c:\tmp\sample.json` t;
>> +++
>> | id | EXPR$1 |
>> +++
>> | 0001 |
>> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
>> Food"}] |
>> | 0002 | [{"id":"2001","type":"Regular"}] |
>> | 0003 |
>> [{"id":"3001","type":"Regular"},{"id":"3002","type":"Chocolate"}] |
>> +++
>> 3 rows selected (0.174 seconds)
>>
>>
>> 0: jdbc:drill:zk=local> select id, t.batters.batter.id from
>> dfs.`c:\tmp\sample.json` t;
>> +---+-+
>> |  id   | EXPR$1  |
>> +---+-+
>> | 0001  | 1001| <-- OK
>> | 0002  | 1002| <-- Wrong, 1002 is not associated with 0002
>> | 0003  | 1003| <-- Wrong, 1003 is not associated with 0003
>> +---+-+
>> 3 rows selected (0.182 seconds)
>>
>> -- Jiang
>>
>>
>>
>> -Original Message-
>> From: Jiang Wu [mailto:jiang...@numerxdata.com]
>> Sent: Thursday, March 10, 2016 6:16 PM
>> To: user@drill.apache.org
>> Subject: RE: Question on nested JSON behavior
>>
>> One problem with the current behavior is that the results are not
>> correlated to the JSON row object.  It looks like the code simply picks the
>> first N values from the union of all values across all rows.  The N is the
>> number of rows in the result.
>>
>> For example, if I give this query:
>>
>> 0: jdbc:drill:zk=local> select id, t.batters.batter.id from
>> dfs.`c:\tmp\sample.json` t;
>> +---+-+
>> |  id   | EXPR$1  |
>> +---+-+
>> | 0001  | 1001|
>> | 0002  | 1002|
>> | 0003  | 1003|
>> +---+-+
>> 3 rows selected (0.182 seconds)
>>
>> The above cannot be correct because "1001", "1002", and "1003" never
>> appears in the the row with id being "0001".
>>
>> Returning an error will be better than returning the wrong answer.
>>
>> -- Jiang
>>
>> -Original Message-
>> From: Steven Phillips [mailto:ste...@dremio.com]
>> Sent: Thursday, March 10, 2016 5:41 PM
>> To: user 
>> Subject: Re: Question on nested JSON behavior
>>
>> I am surprised that you are getting that result. I would have expected
>> the query to fail. Since batter is an array, you should specify the index
>> of the array if yo want to access lower level elements.
>>
>> A way to access all of the sub-fields of a repeated map is something
>> we've discussed, but never implemented. Until it is implemented, I think
>> the correct behavior is for this query to fail.
>>
>> On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala <
>> nrentachint...@maprtech.com> wrote:
>>
>> > Actually I agree with Jiang. The result does seem unintuitive. If it
>> > is a file with just a list, it does still make sense to return the ids
>> > in that list as an array unless the user has configured Drill to
>> > automatically flatten the first level.
>> > Does anyone know how does the other systems behave for this use case?
>> > (for
>> > ex: Mongo)
>> >
>> >
>> >
>> > On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith
>> > 
>> > wrote:
>> >
>> > > Hi Jiang,
>> > >
>> > > Think of it this way: If you had a file that was just the list:
>> > >
>> > > {"id":"1001","type":"Regular"}
>> > > {"id":"1002","type":"Chocolate"}
>> > > {"id":"1003","type":"Blueberry"}
>> > > {"id":"1004","type":"Devil's Food"}
>> > >
>> > > What would you like it to return when you query:
>> > >
>> > > select id from dfs.`/path/to/sample_file.json`;
>> > >
>> > > ?
>> > >
>> > > When you enter the query that you're asking about, you're indicating
>> > > exactly that structure of data. Does this explanation make sense?
>> > >
>> > > Best,
>> > > Nathan
>> > >
>> > > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
>> > wrote:
>> > > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
>> > > structure:
>> > > >
>> > > >   {  ...
>> > > > "batters":
>> > > >   {
>> > > > "batter":
>> > > >   [
>> > > > { "id": "1001", "type": "Regular" },
>> > > > { "id": "1002", "type": "Chocolate" },
>> > > > { "id": "1003", "type": "Blueberry" },
>> > > > { "id": "1004", "type": "Devil's Food" }
>> > > >   ]
>> > > >   },
>> > > > ...
>> > > >   }
>> > > >
>> > > > Now running a few sample queries against the above data:
>> > > >
>> > > >
>> > > > A)  select "batters" 

Re: Question on nested JSON behavior

2016-03-10 Thread Steven Phillips
Yeah, it's definitely a bug. Could you please file a jira?

On Thu, Mar 10, 2016 at 6:19 PM, Jiang Wu  wrote:

> Here are the complete examples to illustrate the problem:
>
> 0: jdbc:drill:zk=local> select id, t.batters.batter from
> dfs.`c:\tmp\sample.json` t;
> +++
> | id | EXPR$1 |
> +++
> | 0001 |
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> Food"}] |
> | 0002 | [{"id":"2001","type":"Regular"}] |
> | 0003 | [{"id":"3001","type":"Regular"},{"id":"3002","type":"Chocolate"}]
> |
> +++
> 3 rows selected (0.174 seconds)
>
>
> 0: jdbc:drill:zk=local> select id, t.batters.batter.id from
> dfs.`c:\tmp\sample.json` t;
> +---+-+
> |  id   | EXPR$1  |
> +---+-+
> | 0001  | 1001| <-- OK
> | 0002  | 1002| <-- Wrong, 1002 is not associated with 0002
> | 0003  | 1003| <-- Wrong, 1003 is not associated with 0003
> +---+-+
> 3 rows selected (0.182 seconds)
>
> -- Jiang
>
>
>
> -Original Message-
> From: Jiang Wu [mailto:jiang...@numerxdata.com]
> Sent: Thursday, March 10, 2016 6:16 PM
> To: user@drill.apache.org
> Subject: RE: Question on nested JSON behavior
>
> One problem with the current behavior is that the results are not
> correlated to the JSON row object.  It looks like the code simply picks the
> first N values from the union of all values across all rows.  The N is the
> number of rows in the result.
>
> For example, if I give this query:
>
> 0: jdbc:drill:zk=local> select id, t.batters.batter.id from
> dfs.`c:\tmp\sample.json` t;
> +---+-+
> |  id   | EXPR$1  |
> +---+-+
> | 0001  | 1001|
> | 0002  | 1002|
> | 0003  | 1003|
> +---+-+
> 3 rows selected (0.182 seconds)
>
> The above cannot be correct because "1001", "1002", and "1003" never
> appears in the the row with id being "0001".
>
> Returning an error will be better than returning the wrong answer.
>
> -- Jiang
>
> -Original Message-
> From: Steven Phillips [mailto:ste...@dremio.com]
> Sent: Thursday, March 10, 2016 5:41 PM
> To: user 
> Subject: Re: Question on nested JSON behavior
>
> I am surprised that you are getting that result. I would have expected the
> query to fail. Since batter is an array, you should specify the index of
> the array if yo want to access lower level elements.
>
> A way to access all of the sub-fields of a repeated map is something we've
> discussed, but never implemented. Until it is implemented, I think the
> correct behavior is for this query to fail.
>
> On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala <
> nrentachint...@maprtech.com> wrote:
>
> > Actually I agree with Jiang. The result does seem unintuitive. If it
> > is a file with just a list, it does still make sense to return the ids
> > in that list as an array unless the user has configured Drill to
> > automatically flatten the first level.
> > Does anyone know how does the other systems behave for this use case?
> > (for
> > ex: Mongo)
> >
> >
> >
> > On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith
> > 
> > wrote:
> >
> > > Hi Jiang,
> > >
> > > Think of it this way: If you had a file that was just the list:
> > >
> > > {"id":"1001","type":"Regular"}
> > > {"id":"1002","type":"Chocolate"}
> > > {"id":"1003","type":"Blueberry"}
> > > {"id":"1004","type":"Devil's Food"}
> > >
> > > What would you like it to return when you query:
> > >
> > > select id from dfs.`/path/to/sample_file.json`;
> > >
> > > ?
> > >
> > > When you enter the query that you're asking about, you're indicating
> > > exactly that structure of data. Does this explanation make sense?
> > >
> > > Best,
> > > Nathan
> > >
> > > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> > wrote:
> > > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > > structure:
> > > >
> > > >   {  ...
> > > > "batters":
> > > >   {
> > > > "batter":
> > > >   [
> > > > { "id": "1001", "type": "Regular" },
> > > > { "id": "1002", "type": "Chocolate" },
> > > > { "id": "1003", "type": "Blueberry" },
> > > > { "id": "1004", "type": "Devil's Food" }
> > > >   ]
> > > >   },
> > > > ...
> > > >   }
> > > >
> > > > Now running a few sample queries against the above data:
> > > >
> > > >
> > > > A)  select "batters" returns expected results, which are the
> values
> > > of "batters" from each row.
> > > >
> > > > 0: jdbc:drill:zk=local> select batters from
> > > > dfs.`c:\tmp\sample.json`;
> > > > +-+
> > > > | batters |
> > > > +-+
> > > > |
> > >
> > {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> > te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > > Food"}]} |
> > > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > > |
> > >
> > {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> > te"}]}
> > 

RE: Question on nested JSON behavior

2016-03-10 Thread Jiang Wu
Here are the complete examples to illustrate the problem:

0: jdbc:drill:zk=local> select id, t.batters.batter from 
dfs.`c:\tmp\sample.json` t;
+++
| id | EXPR$1 |
+++
| 0001 | 
[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
 Food"}] |
| 0002 | [{"id":"2001","type":"Regular"}] |
| 0003 | [{"id":"3001","type":"Regular"},{"id":"3002","type":"Chocolate"}] |
+++
3 rows selected (0.174 seconds)


0: jdbc:drill:zk=local> select id, t.batters.batter.id from 
dfs.`c:\tmp\sample.json` t;
+---+-+
|  id   | EXPR$1  |
+---+-+
| 0001  | 1001| <-- OK
| 0002  | 1002| <-- Wrong, 1002 is not associated with 0002
| 0003  | 1003| <-- Wrong, 1003 is not associated with 0003
+---+-+
3 rows selected (0.182 seconds)

-- Jiang



-Original Message-
From: Jiang Wu [mailto:jiang...@numerxdata.com] 
Sent: Thursday, March 10, 2016 6:16 PM
To: user@drill.apache.org
Subject: RE: Question on nested JSON behavior

One problem with the current behavior is that the results are not correlated to 
the JSON row object.  It looks like the code simply picks the first N values 
from the union of all values across all rows.  The N is the number of rows in 
the result.

For example, if I give this query:

0: jdbc:drill:zk=local> select id, t.batters.batter.id from 
dfs.`c:\tmp\sample.json` t;
+---+-+
|  id   | EXPR$1  |
+---+-+
| 0001  | 1001|
| 0002  | 1002|
| 0003  | 1003|
+---+-+
3 rows selected (0.182 seconds)

The above cannot be correct because "1001", "1002", and "1003" never appears in 
the the row with id being "0001".  

Returning an error will be better than returning the wrong answer.

-- Jiang

-Original Message-
From: Steven Phillips [mailto:ste...@dremio.com]
Sent: Thursday, March 10, 2016 5:41 PM
To: user 
Subject: Re: Question on nested JSON behavior

I am surprised that you are getting that result. I would have expected the 
query to fail. Since batter is an array, you should specify the index of the 
array if yo want to access lower level elements.

A way to access all of the sub-fields of a repeated map is something we've 
discussed, but never implemented. Until it is implemented, I think the correct 
behavior is for this query to fail.

On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala < 
nrentachint...@maprtech.com> wrote:

> Actually I agree with Jiang. The result does seem unintuitive. If it 
> is a file with just a list, it does still make sense to return the ids 
> in that list as an array unless the user has configured Drill to 
> automatically flatten the first level.
> Does anyone know how does the other systems behave for this use case? 
> (for
> ex: Mongo)
>
>
>
> On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
> 
> wrote:
>
> > Hi Jiang,
> >
> > Think of it this way: If you had a file that was just the list:
> >
> > {"id":"1001","type":"Regular"}
> > {"id":"1002","type":"Chocolate"}
> > {"id":"1003","type":"Blueberry"}
> > {"id":"1004","type":"Devil's Food"}
> >
> > What would you like it to return when you query:
> >
> > select id from dfs.`/path/to/sample_file.json`;
> >
> > ?
> >
> > When you enter the query that you're asking about, you're indicating 
> > exactly that structure of data. Does this explanation make sense?
> >
> > Best,
> > Nathan
> >
> > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> wrote:
> > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > structure:
> > >
> > >   {  ...
> > > "batters":
> > >   {
> > > "batter":
> > >   [
> > > { "id": "1001", "type": "Regular" },
> > > { "id": "1002", "type": "Chocolate" },
> > > { "id": "1003", "type": "Blueberry" },
> > > { "id": "1004", "type": "Devil's Food" }
> > >   ]
> > >   },
> > > ...
> > >   }
> > >
> > > Now running a few sample queries against the above data:
> > >
> > >
> > > A)  select "batters" returns expected results, which are the values
> > of "batters" from each row.
> > >
> > > 0: jdbc:drill:zk=local> select batters from 
> > > dfs.`c:\tmp\sample.json`;
> > > +-+
> > > | batters |
> > > +-+
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}]} |
> > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"}]}
> > |
> > > +-+
> > > 3 rows selected (0.243 seconds)
> > >
> > >
> > > B)  select "batters.batter" also returns the expected results,
> which
> > are the array values for "batters.batter" from each row.
> > >
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter from
> > dfs.`c:\tmp\sample.json` t;
> > > ++
> > > | EXPR$0 |
> > > ++
> > > |
> >
> [

RE: Question on nested JSON behavior

2016-03-10 Thread Jiang Wu
One problem with the current behavior is that the results are not correlated to 
the JSON row object.  It looks like the code simply picks the first N values 
from the union of all values across all rows.  The N is the number of rows in 
the result.

For example, if I give this query:

0: jdbc:drill:zk=local> select id, t.batters.batter.id from 
dfs.`c:\tmp\sample.json` t;
+---+-+
|  id   | EXPR$1  |
+---+-+
| 0001  | 1001|
| 0002  | 1002|
| 0003  | 1003|
+---+-+
3 rows selected (0.182 seconds)

The above cannot be correct because "1001", "1002", and "1003" never appears in 
the the row with id being "0001".  

Returning an error will be better than returning the wrong answer.

-- Jiang

-Original Message-
From: Steven Phillips [mailto:ste...@dremio.com] 
Sent: Thursday, March 10, 2016 5:41 PM
To: user 
Subject: Re: Question on nested JSON behavior

I am surprised that you are getting that result. I would have expected the 
query to fail. Since batter is an array, you should specify the index of the 
array if yo want to access lower level elements.

A way to access all of the sub-fields of a repeated map is something we've 
discussed, but never implemented. Until it is implemented, I think the correct 
behavior is for this query to fail.

On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala < 
nrentachint...@maprtech.com> wrote:

> Actually I agree with Jiang. The result does seem unintuitive. If it 
> is a file with just a list, it does still make sense to return the ids 
> in that list as an array unless the user has configured Drill to 
> automatically flatten the first level.
> Does anyone know how does the other systems behave for this use case? 
> (for
> ex: Mongo)
>
>
>
> On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
> 
> wrote:
>
> > Hi Jiang,
> >
> > Think of it this way: If you had a file that was just the list:
> >
> > {"id":"1001","type":"Regular"}
> > {"id":"1002","type":"Chocolate"}
> > {"id":"1003","type":"Blueberry"}
> > {"id":"1004","type":"Devil's Food"}
> >
> > What would you like it to return when you query:
> >
> > select id from dfs.`/path/to/sample_file.json`;
> >
> > ?
> >
> > When you enter the query that you're asking about, you're indicating 
> > exactly that structure of data. Does this explanation make sense?
> >
> > Best,
> > Nathan
> >
> > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> wrote:
> > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > structure:
> > >
> > >   {  ...
> > > "batters":
> > >   {
> > > "batter":
> > >   [
> > > { "id": "1001", "type": "Regular" },
> > > { "id": "1002", "type": "Chocolate" },
> > > { "id": "1003", "type": "Blueberry" },
> > > { "id": "1004", "type": "Devil's Food" }
> > >   ]
> > >   },
> > > ...
> > >   }
> > >
> > > Now running a few sample queries against the above data:
> > >
> > >
> > > A)  select "batters" returns expected results, which are the values
> > of "batters" from each row.
> > >
> > > 0: jdbc:drill:zk=local> select batters from 
> > > dfs.`c:\tmp\sample.json`;
> > > +-+
> > > | batters |
> > > +-+
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}]} |
> > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"}]}
> > |
> > > +-+
> > > 3 rows selected (0.243 seconds)
> > >
> > >
> > > B)  select "batters.batter" also returns the expected results,
> which
> > are the array values for "batters.batter" from each row.
> > >
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter from
> > dfs.`c:\tmp\sample.json` t;
> > > ++
> > > | EXPR$0 |
> > > ++
> > > |
> >
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id"
> :"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}] |
> > > | [{"id":"1001","type":"Regular"}] | 
> > > | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}
> > > | ] |
> > > ++
> > > 3 rows selected (0.198 seconds)
> > >
> > >
> > > C)  select "batters.batter.id" returns something unexpected:
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter.id from
> > dfs.`c:\tmp\sample.json` t;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 1001|
> > > | 1002|
> > > | 1003|
> > > +-+
> > >
> > > The above result doesn't make sense.  The result looks like the 3
> values
> > from row 1. Should the result be the following instead?
> > >
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | [1001, 1002, 1003, 1004]|
> > > | [1001]|
> > > | [1001, 1002]|
> > > +-+
> > >
> > > Any hints on what is happening here?  Thanks.
> > >
> > > -- Jiang
> > >
> >
>


RE: Question on nested JSON behavior

2016-03-10 Thread Jiang Wu
I will play around with MongoDB and report back the findings.  Thanks.

-- Jiang


-Original Message-
From: Steven Phillips [mailto:ste...@dremio.com] 
Sent: Thursday, March 10, 2016 5:41 PM
To: user 
Subject: Re: Question on nested JSON behavior

I am surprised that you are getting that result. I would have expected the 
query to fail. Since batter is an array, you should specify the index of the 
array if yo want to access lower level elements.

A way to access all of the sub-fields of a repeated map is something we've 
discussed, but never implemented. Until it is implemented, I think the correct 
behavior is for this query to fail.

On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala < 
nrentachint...@maprtech.com> wrote:

> Actually I agree with Jiang. The result does seem unintuitive. If it 
> is a file with just a list, it does still make sense to return the ids 
> in that list as an array unless the user has configured Drill to 
> automatically flatten the first level.
> Does anyone know how does the other systems behave for this use case? 
> (for
> ex: Mongo)
>
>
>
> On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
> 
> wrote:
>
> > Hi Jiang,
> >
> > Think of it this way: If you had a file that was just the list:
> >
> > {"id":"1001","type":"Regular"}
> > {"id":"1002","type":"Chocolate"}
> > {"id":"1003","type":"Blueberry"}
> > {"id":"1004","type":"Devil's Food"}
> >
> > What would you like it to return when you query:
> >
> > select id from dfs.`/path/to/sample_file.json`;
> >
> > ?
> >
> > When you enter the query that you're asking about, you're indicating 
> > exactly that structure of data. Does this explanation make sense?
> >
> > Best,
> > Nathan
> >
> > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> wrote:
> > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > structure:
> > >
> > >   {  ...
> > > "batters":
> > >   {
> > > "batter":
> > >   [
> > > { "id": "1001", "type": "Regular" },
> > > { "id": "1002", "type": "Chocolate" },
> > > { "id": "1003", "type": "Blueberry" },
> > > { "id": "1004", "type": "Devil's Food" }
> > >   ]
> > >   },
> > > ...
> > >   }
> > >
> > > Now running a few sample queries against the above data:
> > >
> > >
> > > A)  select "batters" returns expected results, which are the values
> > of "batters" from each row.
> > >
> > > 0: jdbc:drill:zk=local> select batters from 
> > > dfs.`c:\tmp\sample.json`;
> > > +-+
> > > | batters |
> > > +-+
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}]} |
> > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocola
> te"}]}
> > |
> > > +-+
> > > 3 rows selected (0.243 seconds)
> > >
> > >
> > > B)  select "batters.batter" also returns the expected results,
> which
> > are the array values for "batters.batter" from each row.
> > >
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter from
> > dfs.`c:\tmp\sample.json` t;
> > > ++
> > > | EXPR$0 |
> > > ++
> > > |
> >
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id"
> :"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}] |
> > > | [{"id":"1001","type":"Regular"}] | 
> > > | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}
> > > | ] |
> > > ++
> > > 3 rows selected (0.198 seconds)
> > >
> > >
> > > C)  select "batters.batter.id" returns something unexpected:
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter.id from
> > dfs.`c:\tmp\sample.json` t;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 1001|
> > > | 1002|
> > > | 1003|
> > > +-+
> > >
> > > The above result doesn't make sense.  The result looks like the 3
> values
> > from row 1. Should the result be the following instead?
> > >
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | [1001, 1002, 1003, 1004]|
> > > | [1001]|
> > > | [1001, 1002]|
> > > +-+
> > >
> > > Any hints on what is happening here?  Thanks.
> > >
> > > -- Jiang
> > >
> >
>


Re: Question on nested JSON behavior

2016-03-10 Thread Steven Phillips
I am surprised that you are getting that result. I would have expected the
query to fail. Since batter is an array, you should specify the index of
the array if yo want to access lower level elements.

A way to access all of the sub-fields of a repeated map is something we've
discussed, but never implemented. Until it is implemented, I think the
correct behavior is for this query to fail.

On Thu, Mar 10, 2016 at 4:42 PM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Actually I agree with Jiang. The result does seem unintuitive. If it is a
> file with just a list, it does still make sense to return the ids in that
> list as an array unless the user has configured Drill to automatically
> flatten the first level.
> Does anyone know how does the other systems behave for this use case? (for
> ex: Mongo)
>
>
>
> On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
> wrote:
>
> > Hi Jiang,
> >
> > Think of it this way: If you had a file that was just the list:
> >
> > {"id":"1001","type":"Regular"}
> > {"id":"1002","type":"Chocolate"}
> > {"id":"1003","type":"Blueberry"}
> > {"id":"1004","type":"Devil's Food"}
> >
> > What would you like it to return when you query:
> >
> > select id from dfs.`/path/to/sample_file.json`;
> >
> > ?
> >
> > When you enter the query that you're asking about, you're indicating
> > exactly that structure of data. Does this explanation make sense?
> >
> > Best,
> > Nathan
> >
> > On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu 
> wrote:
> > > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> > structure:
> > >
> > >   {  ...
> > > "batters":
> > >   {
> > > "batter":
> > >   [
> > > { "id": "1001", "type": "Regular" },
> > > { "id": "1002", "type": "Chocolate" },
> > > { "id": "1003", "type": "Blueberry" },
> > > { "id": "1004", "type": "Devil's Food" }
> > >   ]
> > >   },
> > > ...
> > >   }
> > >
> > > Now running a few sample queries against the above data:
> > >
> > >
> > > A)  select "batters" returns expected results, which are the values
> > of "batters" from each row.
> > >
> > > 0: jdbc:drill:zk=local> select batters from dfs.`c:\tmp\sample.json`;
> > > +-+
> > > | batters |
> > > +-+
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}]} |
> > > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > > |
> >
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}
> > |
> > > +-+
> > > 3 rows selected (0.243 seconds)
> > >
> > >
> > > B)  select "batters.batter" also returns the expected results,
> which
> > are the array values for "batters.batter" from each row.
> > >
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter from
> > dfs.`c:\tmp\sample.json` t;
> > > ++
> > > | EXPR$0 |
> > > ++
> > > |
> >
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> > Food"}] |
> > > | [{"id":"1001","type":"Regular"}] |
> > > | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}] |
> > > ++
> > > 3 rows selected (0.198 seconds)
> > >
> > >
> > > C)  select "batters.batter.id" returns something unexpected:
> > >
> > > 0: jdbc:drill:zk=local> select t.batters.batter.id from
> > dfs.`c:\tmp\sample.json` t;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 1001|
> > > | 1002|
> > > | 1003|
> > > +-+
> > >
> > > The above result doesn't make sense.  The result looks like the 3
> values
> > from row 1. Should the result be the following instead?
> > >
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | [1001, 1002, 1003, 1004]|
> > > | [1001]|
> > > | [1001, 1002]|
> > > +-+
> > >
> > > Any hints on what is happening here?  Thanks.
> > >
> > > -- Jiang
> > >
> >
>


Re: Question on nested JSON behavior

2016-03-10 Thread Neeraja Rentachintala
Actually I agree with Jiang. The result does seem unintuitive. If it is a
file with just a list, it does still make sense to return the ids in that
list as an array unless the user has configured Drill to automatically
flatten the first level.
Does anyone know how does the other systems behave for this use case? (for
ex: Mongo)



On Thu, Mar 10, 2016 at 4:21 PM, Nathan Griffith 
wrote:

> Hi Jiang,
>
> Think of it this way: If you had a file that was just the list:
>
> {"id":"1001","type":"Regular"}
> {"id":"1002","type":"Chocolate"}
> {"id":"1003","type":"Blueberry"}
> {"id":"1004","type":"Devil's Food"}
>
> What would you like it to return when you query:
>
> select id from dfs.`/path/to/sample_file.json`;
>
> ?
>
> When you enter the query that you're asking about, you're indicating
> exactly that structure of data. Does this explanation make sense?
>
> Best,
> Nathan
>
> On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu  wrote:
> > Drill version: 1.4.0.  Assuming 3 JSON objects with the following
> structure:
> >
> >   {  ...
> > "batters":
> >   {
> > "batter":
> >   [
> > { "id": "1001", "type": "Regular" },
> > { "id": "1002", "type": "Chocolate" },
> > { "id": "1003", "type": "Blueberry" },
> > { "id": "1004", "type": "Devil's Food" }
> >   ]
> >   },
> > ...
> >   }
> >
> > Now running a few sample queries against the above data:
> >
> >
> > A)  select "batters" returns expected results, which are the values
> of "batters" from each row.
> >
> > 0: jdbc:drill:zk=local> select batters from dfs.`c:\tmp\sample.json`;
> > +-+
> > | batters |
> > +-+
> > |
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> Food"}]} |
> > | {"batter":[{"id":"1001","type":"Regular"}]} |
> > |
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}
> |
> > +-+
> > 3 rows selected (0.243 seconds)
> >
> >
> > B)  select "batters.batter" also returns the expected results, which
> are the array values for "batters.batter" from each row.
> >
> >
> > 0: jdbc:drill:zk=local> select t.batters.batter from
> dfs.`c:\tmp\sample.json` t;
> > ++
> > | EXPR$0 |
> > ++
> > |
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
> Food"}] |
> > | [{"id":"1001","type":"Regular"}] |
> > | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}] |
> > ++
> > 3 rows selected (0.198 seconds)
> >
> >
> > C)  select "batters.batter.id" returns something unexpected:
> >
> > 0: jdbc:drill:zk=local> select t.batters.batter.id from
> dfs.`c:\tmp\sample.json` t;
> > +-+
> > | EXPR$0  |
> > +-+
> > | 1001|
> > | 1002|
> > | 1003|
> > +-+
> >
> > The above result doesn't make sense.  The result looks like the 3 values
> from row 1. Should the result be the following instead?
> >
> > +-+
> > | EXPR$0  |
> > +-+
> > | [1001, 1002, 1003, 1004]|
> > | [1001]|
> > | [1001, 1002]|
> > +-+
> >
> > Any hints on what is happening here?  Thanks.
> >
> > -- Jiang
> >
>


Re: Question on nested JSON behavior

2016-03-10 Thread Nathan Griffith
Hi Jiang,

Think of it this way: If you had a file that was just the list:

{"id":"1001","type":"Regular"}
{"id":"1002","type":"Chocolate"}
{"id":"1003","type":"Blueberry"}
{"id":"1004","type":"Devil's Food"}

What would you like it to return when you query:

select id from dfs.`/path/to/sample_file.json`;

?

When you enter the query that you're asking about, you're indicating
exactly that structure of data. Does this explanation make sense?

Best,
Nathan

On Thu, Mar 10, 2016 at 4:07 PM, Jiang Wu  wrote:
> Drill version: 1.4.0.  Assuming 3 JSON objects with the following structure:
>
>   {  ...
> "batters":
>   {
> "batter":
>   [
> { "id": "1001", "type": "Regular" },
> { "id": "1002", "type": "Chocolate" },
> { "id": "1003", "type": "Blueberry" },
> { "id": "1004", "type": "Devil's Food" }
>   ]
>   },
> ...
>   }
>
> Now running a few sample queries against the above data:
>
>
> A)  select "batters" returns expected results, which are the values of 
> "batters" from each row.
>
> 0: jdbc:drill:zk=local> select batters from dfs.`c:\tmp\sample.json`;
> +-+
> | batters |
> +-+
> | 
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
>  Food"}]} |
> | {"batter":[{"id":"1001","type":"Regular"}]} |
> | 
> {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]} |
> +-+
> 3 rows selected (0.243 seconds)
>
>
> B)  select "batters.batter" also returns the expected results, which are 
> the array values for "batters.batter" from each row.
>
>
> 0: jdbc:drill:zk=local> select t.batters.batter from dfs.`c:\tmp\sample.json` 
> t;
> ++
> | EXPR$0 |
> ++
> | 
> [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
>  Food"}] |
> | [{"id":"1001","type":"Regular"}] |
> | [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}] |
> ++
> 3 rows selected (0.198 seconds)
>
>
> C)  select "batters.batter.id" returns something unexpected:
>
> 0: jdbc:drill:zk=local> select t.batters.batter.id from 
> dfs.`c:\tmp\sample.json` t;
> +-+
> | EXPR$0  |
> +-+
> | 1001|
> | 1002|
> | 1003|
> +-+
>
> The above result doesn't make sense.  The result looks like the 3 values from 
> row 1. Should the result be the following instead?
>
> +-+
> | EXPR$0  |
> +-+
> | [1001, 1002, 1003, 1004]|
> | [1001]|
> | [1001, 1002]|
> +-+
>
> Any hints on what is happening here?  Thanks.
>
> -- Jiang
>


Question on nested JSON behavior

2016-03-10 Thread Jiang Wu
Drill version: 1.4.0.  Assuming 3 JSON objects with the following structure:

  {  ...
"batters":
  {
"batter":
  [
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
  ]
  },
...
  }

Now running a few sample queries against the above data:


A)  select "batters" returns expected results, which are the values of 
"batters" from each row.

0: jdbc:drill:zk=local> select batters from dfs.`c:\tmp\sample.json`;
+-+
| batters |
+-+
| 
{"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
 Food"}]} |
| {"batter":[{"id":"1001","type":"Regular"}]} |
| {"batter":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]} |
+-+
3 rows selected (0.243 seconds)


B)  select "batters.batter" also returns the expected results, which are 
the array values for "batters.batter" from each row.


0: jdbc:drill:zk=local> select t.batters.batter from dfs.`c:\tmp\sample.json` t;
++
| EXPR$0 |
++
| 
[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"},{"id":"1003","type":"Blueberry"},{"id":"1004","type":"Devil's
 Food"}] |
| [{"id":"1001","type":"Regular"}] |
| [{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}] |
++
3 rows selected (0.198 seconds)


C)  select "batters.batter.id" returns something unexpected:

0: jdbc:drill:zk=local> select t.batters.batter.id from 
dfs.`c:\tmp\sample.json` t;
+-+
| EXPR$0  |
+-+
| 1001|
| 1002|
| 1003|
+-+

The above result doesn't make sense.  The result looks like the 3 values from 
row 1. Should the result be the following instead?

+-+
| EXPR$0  |
+-+
| [1001, 1002, 1003, 1004]|
| [1001]|
| [1001, 1002]|
+-+

Any hints on what is happening here?  Thanks.

-- Jiang



Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt

~~~
00-01  Project(date_tm=[CAST($23):TIMESTAMP(0)], 
id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE 
"ISO-8859-1$en_US$primary"], id_2=[CAST($15):VARCHAR(1) CHARACTER SET 
"ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"], 
id_3=[CAST($33):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE 
"ISO-8859-1$en_US$primary"], b_total=[CAST(CASE(IS NOT NULL($4), $4, 
0)):BIGINT], t_total=[CAST(CASE(IS NOT NULL($31), $31, 0)):DOUBLE], 
h_total=[CAST(CASE(IS NOT NULL($40), $40, 0)):BIGINT], 
b_small=[CAST(CASE(IS NOT NULL($36), $36, 0)):BIGINT], 
t_small=[CAST(CASE(IS NOT NULL($14), $14, 0)):DOUBLE], 
h_small=[CAST(CASE(IS NOT NULL($38), $38, 0)):BIGINT], 
b_18000=[CAST(CASE(IS NOT NULL($32), $32, 0)):BIGINT], 
t_18000=[CAST(CASE(IS NOT NULL($24), $24, 0)):DOUBLE], 
h_18000=[CAST(CASE(IS NOT NULL($27), $27, 0)):BIGINT], 
b_12000=[CAST(CASE(IS NOT NULL($30), $30, 0)):BIGINT], 
t_12000=[CAST(CASE(IS NOT NULL($28), $28, 0)):DOUBLE], 
h_12000=[CAST(CASE(IS NOT NULL($20), $20, 0)):BIGINT], 
b_6000=[CAST(CASE(IS NOT NULL($41), $41, 0)):BIGINT], 
t_6000=[CAST(CASE(IS NOT NULL($37), $37, 0)):DOUBLE], 
h_6000=[CAST(CASE(IS NOT NULL($29), $29, 0)):BIGINT], 
b_3000=[CAST(CASE(IS NOT NULL($17), $17, 0)):BIGINT], 
t_3000=[CAST(CASE(IS NOT NULL($7), $7, 0)):DOUBLE], h_3000=[CAST(CASE(IS 
NOT NULL($1), $1, 0)):BIGINT], b_2000=[CAST(CASE(IS NOT NULL($26), $26, 
0)):BIGINT], t_2000=[CAST(CASE(IS NOT NULL($34), $34, 0)):DOUBLE], 
h_2000=[CAST(CASE(IS NOT NULL($10), $10, 0)):BIGINT], 
b_1500=[CAST(CASE(IS NOT NULL($42), $42, 0)):BIGINT], 
t_1500=[CAST(CASE(IS NOT NULL($13), $13, 0)):DOUBLE], 
h_1500=[CAST(CASE(IS NOT NULL($3), $3, 0)):BIGINT], b_1250=[CAST(CASE(IS 
NOT NULL($21), $21, 0)):BIGINT], t_1250=[CAST(CASE(IS NOT NULL($25), 
$25, 0)):DOUBLE], h_1250=[CAST(CASE(IS NOT NULL($16), $16, 0)):BIGINT], 
b_1000=[CAST(CASE(IS NOT NULL($12), $12, 0)):BIGINT], 
t_1000=[CAST(CASE(IS NOT NULL($19), $19, 0)):DOUBLE], 
h_1000=[CAST(CASE(IS NOT NULL($6), $6, 0)):BIGINT], b_750=[CAST(CASE(IS 
NOT NULL($9), $9, 0)):BIGINT], t_750=[CAST(CASE(IS NOT NULL($0), $0, 
0)):DOUBLE], h_750=[CAST(CASE(IS NOT NULL($5), $5, 0)):BIGINT], 
b_500=[CAST(CASE(IS NOT NULL($2), $2, 0)):BIGINT], t_500=[CAST(CASE(IS 
NOT NULL($8), $8, 0)):DOUBLE], h_500=[CAST(CASE(IS NOT NULL($39), $39, 
0)):BIGINT], b_0=[CAST(CASE(IS NOT NULL($18), $18, 0)):BIGINT], 
t_0=[CAST(CASE(IS NOT NULL($35), $35, 0)):DOUBLE], EXPR$42=[CAST(CASE(IS 
NOT NULL($22), $22, 0)):BIGINT])
00-02Scan(groupscan=[EasyGroupScan 
[selectionRoot=hdfs://es05:54310/caspr/csv/smalltest.csv, numFiles=1, 
columns=[`date_tm`, `id_1`, `id_2`, `id_3`, `b_total`, `t_total`, 
`h_total`, `b_small`, `t_small`, `h_small`, `b_18000`, `t_18000`, 
`h_18000`, `b_12000`, `t_12000`, `h_12000`, `b_6000`, `t_6000`, 
`h_6000`, `b_3000`, `t_3000`, `h_3000`, `b_2000`, `t_2000`, `h_2000`, 
`b_1500`, `t_1500`, `h_1500`, `b_1250`, `t_1250`, `h_1250`, `b_1000`, 
`t_1000`, `h_1000`, `b_750`, `t_750`, `h_750`, `b_500`, `t_500`, 
`h_500`, `b_0`, `t_0`, `h_0`], 
files=[hdfs://es05:54310/caspr/csv/smalltest.csv]]])

~~~

~~~
{
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ {
  "kind" : "STRING",
  "type" : "SESSION",
  "name" : "store.format",
  "string_val" : "parquet"
}, {
  "kind" : "BOOLEAN",
  "type" : "SESSION",
  "name" : "exec.errors.verbose",
  "bool_val" : true
} ],
"queue" : 0,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "fs-scan",
"@id" : 2,
"userName" : "hduser",
"files" : [ "hdfs://es05:54310/caspr/csv/smalltest.csv" ],
"storage" : {
  "type" : "file",
  "enabled" : true,
  "connection" : "hdfs://es05:54310",
  "workspaces" : {
"root" : {
  "location" : "/",
  "writable" : true,
  "defaultInputFormat" : null
},
"tmp" : {
  "location" : "/tmp/",
  "writable" : true,
  "defaultInputFormat" : null
},
"caspr" : {
  "location" : "/caspr",
  "writable" : true,
  "defaultInputFormat" : "csv"
}
  "ref" : "`t_3000`",
  "expr" : "cast( ( ( if (isnotnull(`t_3000`)  ) then (`t_3000` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_3000`",
  "expr" : "cast( ( ( if (isnotnull(`h_3000`)  ) then (`h_3000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_2000`",
  "expr" : "cast( ( ( if (isnotnull(`b_2000`)  ) then (`b_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`t_2000`",
  "expr" : "cast( ( ( if (isnotnull(`t_2000`)  ) then (`t_2000` )  
else (0 )  end  )  ) as FLOAT8 )"

}, {
  "ref" : "`h_2000`",
  "expr" : "cast( ( ( if (isnotnull(`h_2000`)  ) then (`h_2000` )  
else (0 )  end  )  ) as BIGINT )"

}, {
  "ref" : "`b_1500`",
  "expr" : "cast( ( ( if (isnotnull(`b_1500`)  ) then (`b_1500` 

Re: NumberFormatException with cast to double?

2016-03-10 Thread Hsuan Yi Chu
>From the log, I see this one:

StringFunctionHelpers.varCharToInt():

It seems like there might be an implicit casting (to integer) be inserted
into your expression.

Can you do EXPLAIN PLAN FOR  "your query" (not double quoted needed)

and share what you see?

On Thu, Mar 10, 2016 at 3:11 PM, Matt  wrote:

> TRIM() did not help, and I visually verified there is no whitespace around
> the number: The entire test data file:
>
> ~~~
>
> date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
> 2015-10-17
> 00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,
> ~~~
>
> Including a trim resulted in the same:
>
> ~~~
> Error: SYSTEM ERROR: NumberFormatException: 140.755
>
> Fragment 0:0
>
> [Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]
>
>   (java.lang.NumberFormatException) 140.755
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120
> org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467
>
> org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():415
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> java.lang.Thread.run():745 (state=,code=0)
> ~~~
>
> On 10 Mar 2016, at 17:49, Jason Altekruse wrote:
>
> Could you try throwing a trim() in before the cast? Might just be
>> whitespace.
>>
>> There shouldn't be a need to add the COALESCE statement, as we only ever
>> read defined values (which may be empty string) out of a CSV file. You
>> should instead write out a full case statement that checks for empty
>> string
>> and provides your default value of 0 in that case.
>>
>> - Jason
>>
>> Jason Altekruse
>> Software Engineer at Dremio
>> Apache Drill Committer
>>
>> On Thu, Mar 10, 2016 at 2:32 PM, Matt  wrote:
>>
>> Have some CSV data that Drill 1.5 selects as-is without any problems,
>>> until I attempt to CAST columns in a CTAS or plain SELECT:
>>>
>>> Error: SYSTEM ERROR: NumberFormatException: 140.755
>>>
>>> The data is unquoted CSV, and column in question does have the value
>>> "140.755" (unquoted). As the column can be empty, I am using the
>>> following
>>> transform:
>>>
>>>   CAST(COALESCE(t_total, 0) AS double) AS t_total
>>>
>>> And on the first data row:
>>>
>>> Fragment 1:0
>>>
>>> [Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]
>>>
>>>   (java.lang.NumberFormatException) 140.755
>>>
>>> What could be causing this error?
>>>
>>>
>>>


Re: NumberFormatException with cast to double?

2016-03-10 Thread Matt
TRIM() did not help, and I visually verified there is no whitespace 
around the number: The entire test data file:


~~~
date_tm,id_1,id_2,id_3,b_total,t_total,h_total,b_small,t_small,h_small,b_18000,t_18000,h_18000,b_12000,t_12000,h_12000,b_6000,t_6000,h_6000,b_3000,t_3000,h_3000,b_2000,t_2000,h_2000,b_1500,t_1500,h_1500,b_1250,t_1250,h_1250,b_1000,t_1000,h_1000,b_750,t_750,h_750,b_500,t_500,h_500,b_0,t_0,h_0
2015-10-17 
00:00:00,f5e9v8u2,err,mi1,268918254,140.755,23519,268918254,140.755,23519,

~~~

Including a trim resulted in the same:

~~~
Error: SYSTEM ERROR: NumberFormatException: 140.755

Fragment 0:0

[Error Id: 01d368ee-6c06-476c-a553-5eb5baea7c3f on es07:31010]

  (java.lang.NumberFormatException) 140.755
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI():95

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToInt():120

org.apache.drill.exec.test.generated.ProjectorGen122.doEval():467

org.apache.drill.exec.test.generated.ProjectorGen122.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():175

org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132

org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81

org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():415
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1145
java.util.concurrent.ThreadPoolExecutor$Worker.run():615
java.lang.Thread.run():745 (state=,code=0)
~~~

On 10 Mar 2016, at 17:49, Jason Altekruse wrote:


Could you try throwing a trim() in before the cast? Might just be
whitespace.

There shouldn't be a need to add the COALESCE statement, as we only 
ever

read defined values (which may be empty string) out of a CSV file. You
should instead write out a full case statement that checks for empty 
string

and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt  wrote:


Have some CSV data that Drill 1.5 selects as-is without any problems,
until I attempt to CAST columns in a CTAS or plain SELECT:

Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the value
"140.755" (unquoted). As the column can be empty, I am using the 
following

transform:

  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?




Re: NumberFormatException with cast to double?

2016-03-10 Thread Jason Altekruse
Could you try throwing a trim() in before the cast? Might just be
whitespace.

There shouldn't be a need to add the COALESCE statement, as we only ever
read defined values (which may be empty string) out of a CSV file. You
should instead write out a full case statement that checks for empty string
and provides your default value of 0 in that case.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Mar 10, 2016 at 2:32 PM, Matt  wrote:

> Have some CSV data that Drill 1.5 selects as-is without any problems,
> until I attempt to CAST columns in a CTAS or plain SELECT:
>
> Error: SYSTEM ERROR: NumberFormatException: 140.755
>
> The data is unquoted CSV, and column in question does have the value
> "140.755" (unquoted). As the column can be empty, I am using the following
> transform:
>
>   CAST(COALESCE(t_total, 0) AS double) AS t_total
>
> And on the first data row:
>
> Fragment 1:0
>
> [Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]
>
>   (java.lang.NumberFormatException) 140.755
>
> What could be causing this error?
>
>


NumberFormatException with cast to double?

2016-03-10 Thread Matt
Have some CSV data that Drill 1.5 selects as-is without any problems, 
until I attempt to CAST columns in a CTAS or plain SELECT:


Error: SYSTEM ERROR: NumberFormatException: 140.755

The data is unquoted CSV, and column in question does have the value 
"140.755" (unquoted). As the column can be empty, I am using the 
following transform:


  CAST(COALESCE(t_total, 0) AS double) AS t_total

And on the first data row:

Fragment 1:0

[Error Id: 4b4d83cf-a87d-451e-9cfa-4280e5adf64f on es08:31010]

  (java.lang.NumberFormatException) 140.755

What could be causing this error?



Re: concurrency questions

2016-03-10 Thread John Omernik
Yadid - Some answers (may not be complete):


>From what I understand, Drill enables concurrency by queuing requests. If
we are preforming many reads, will writes to the same file be queued until
completion of the reads ? This potentially could create a bottle neck
- There are concurrency options, and I would suggest starting here to learn
more about the options available:
https://drill.apache.org/docs/configuring-a-multitenant-cluster-introduction/.
The reads and writes will likely be less of contention in Drill, and more
so in HDFS. (I could be wrong here, but an CTAS (insert) is just a query
that will be queued.

How does Drill manage parquet file partitioning, when using CTAS. Can we
control horizontal / vertical partitioning in some way by configuring the
drill bit ?
- There are multiple options. One thing I use is the directory based
partitioning. This is similar to Apache Hive.  Basically if you have a
directory, you can use that as the directory you query, and then use
subdirectories exposed by the variable "dir0"
So:

table
--- 2015-01-01
--- 2015-01-02
--- 2015-01-03

all the dates there are just directories under "table" so I could do select
* from table where dir0 >= '2015-01-02"; and it would exclude the first
directory.

This is nice, you can create views of the table to make "dir0" into a
fieldname like "part_date" (if you are using 2015-01-01 format for dates,
fun tip: user to_date(dir0, '-MM-dd') as part_date  in your view to get
better performance)
This also plays nice in CTAS, because while you look at the table as table
with subdirectories, you can do CTAS into the individual partitions CREATE
TABLE `table\2015-01-04` as select field1, field2 from src_data.  That way
it loads the individual partition.

Now, one challenge may be data that is being written while being queried.
One hack I've used is to CREATE TABLE `table\.2015-01-04` (note the dot
before the date).  This puts it into a hidden directory, so while writing,
the partition will not be read by other queries. Once it completes
successfully, I then just do a mv .2015-01-14 2015-01-14 which is instant.

There was some talk about a JIRA for a similar feature. Not sure if I did
that yet :)

Any alternative suggestions to the approach above? In terms of read
performance, would this result in better performance (for columnar type
data), than by using something like HBASE?
-Without sounding too dismissive of the question, it really depends on your
data.  HBASE is a nice option for certain data when Parquet seems to not
fit the bill. That said, I think with good partitioning, and with other
improvements coming, Parquet can do quite well.

John



On Thu, Mar 10, 2016 at 12:26 PM, Yadid Ayzenberg 
wrote:

> Hi All,
>
> We are considering using drill to access data for large scale analytics on
> top of parquet files stored on HDFS.
> We would like to add data to this data-set in real-time, as it arrives
> into our system. One propose solution was to use drill to perform both the
> inserts and the selects on our data set.
>
> Some questions that arose:
>
> From what I understand, Drill enables concurrency by queuing requests. If
> we are preforming many reads, will writes to the same file be queued until
> completion of the reads ? This potentially could create a bottle neck
> How does Drill manage parquet file partitioning, when using CTAS. Can we
> control horizontal / vertical partitioning in some way by configuring the
> drill bit ?
> Any alternative suggestions to the approach above? In terms of read
> performance, would this result in better performance (for columnar type
> data), than by using something like HBASE?
> Thanks,
>
> Yadid
>
>
>


Re: Drill with String Aggregation

2016-03-10 Thread Bosung Seo
Thank you for your suggestions.

UDF sounds pretty cool.

Thanks,
Bo

On Wed, Mar 9, 2016 at 5:30 AM, Abdel Hakim Deneche 
wrote:

> User Defined Function doesn't exclude the possibility to include the
> function in Drill's code base, if the user chooses to contribute the
> function back as a patch of a pull request.
>
> We generally suggest the creation of an UDF because it can be much faster
> than opening a JIRA and waiting for a community member to implement the
> function. You are right though as not all users will have the capabilities
> nor the time to do it.
>
> On Wed, Mar 9, 2016 at 2:11 PM, John Omernik  wrote:
>
> > Abdel -
> >
> > Is there a process or guideline for determining what functions should be
> in
> > Drill itself vs. developed by users? I.e. how is it determined if a
> > function should exist in the main Drill project vs. determining it should
> > be user developed? Does the Apache foundation have a process for "add-on"
> > projects? I think from a community perspective, the challenge with the
> "you
> > can always develop..." response (and please, I am not calling you out
> > directly, I've seen this response quite a bit)  is that:
> >
> > A. It can be discouraging for non developer users of Drill. (I'd argue
> that
> > the target user base of Drill is going to be less than 25% Java
> Developers)
> > to hear this response.  Leading some to halt their PoC or move on to
> other
> > projects that may support what they are looking for.
> > B. It will cause a lot of duplicate work (Company A develops string
> > aggregate function, Company B develops string aggregate function, User C
> > develops String Aggregate functions etc) to be done outside of the
> > community, wasting resources as companies need to develop these functions
> > over and over again, and also keeping some users of Drill in the dark
> about
> > potential functions they could use. (I for one expand my knowledge of
> data
> > by reviewing functions and what they do)
> >
> > So to my question about add on projects. Could a sub-project under the
> > Drill project be run where functions can be talked about at a community
> > level?  Releasing code that can be added to drill, high quality
> functions,
> > that operate well, that "put back" into community?  From an Apache
> > standpoint would this be a separate project? I see so many wonderful ways
> > that this could be a huge user base driver, excitement driver, and
> > functionality driver for the Drill project.  Wish lists, challenges, new
> > functionality can be developed openly, and maybe (depending on the answer
> > to my first question) some of these could even make it into the Drill
> > project itself.   Almost like a developer preview of self contained
> > functions.
> >
> > What are the community thoughts on something like this?
> >
> > John
> >
> > On Tue, Mar 8, 2016 at 11:43 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > You can always develop a User Defined Aggregate Function:
> > >
> > > http://drill.apache.org/docs/develop-custom-functions/
> > >
> > > Thanks
> > >
> > > On Wed, Mar 9, 2016 at 12:29 AM, Bosung Seo 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I found that Drill doesn't support string_agg function yet.
> > > > Is there another way to query as the string_agg function?
> > > >
> > > > If I have a table,
> > > > user   | id
> > > > user1 | 1
> > > > user1 | 2
> > > > user1 | 3
> > > > user2 | 1
> > > >
> > > > I want to make like this.
> > > > user   | ids
> > > > user1 | 1,2,3
> > > > user2 | 1
> > > >
> > > > Any help would be appreciated.
> > > >
> > > > Thanks,
> > > > Bo
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>



-- 



*Bosung Seo* | *Software Engineer*

*Cell: *949.742.2826
*Email:  *
*bos...@brightcloud.com **Web:
**www.webroot.com/brightcloud
*

Follow us  on Twitter
Subscribe  to the Webroot
Threat Blog


concurrency questions

2016-03-10 Thread Yadid Ayzenberg
Hi All,

We are considering using drill to access data for large scale analytics on top 
of parquet files stored on HDFS.
We would like to add data to this data-set in real-time, as it arrives into our 
system. One propose solution was to use drill to perform both the inserts and 
the selects on our data set.

Some questions that arose:

>From what I understand, Drill enables concurrency by queuing requests. If we 
>are preforming many reads, will writes to the same file be queued until 
>completion of the reads ? This potentially could create a bottle neck 
How does Drill manage parquet file partitioning, when using CTAS. Can we 
control horizontal / vertical partitioning in some way by configuring the drill 
bit ?
Any alternative suggestions to the approach above? In terms of read 
performance, would this result in better performance (for columnar type data), 
than by using something like HBASE?
Thanks,

Yadid




Re: S3 query takes a long time to start

2016-03-10 Thread Oscar Morante
I've been checking the logs, and I think that the problem is that it's 
walking through the "directories" in S3 recursively, doing lots of small 
HTTP requests.


My files are organized like this which amplifies the issue:

   /category/random-hash/year/month/day/hour/data-chunk-000.json.gz

The random hash is there to trick S3 into using a different 
partition/shard for each put [1].  But it looks like this structure is 
clashing with the way Drill/hadoop.fs.s3a get the list of files.


I think that it should be possible to get the complete list of files 
under a given "directory" (e.g. `/category`) doing just one HTTP query, 
but I don't know how hard it would be to incorporate that behavior.


Any ideas?  How are you organizing your S3 files to get good 
performance?


Thanks!

[1]: 
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html


On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote:
I'm querying 20G of gzipped JSONs split in ~5600 small files with 
sizes ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge 
nodes and it's taking around 50 minutes before the query starts 
executing.


Any idea what could be causing this delay?  What's the best way to 
debug this?


Thanks,


--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Cast function for Oracle

2016-03-10 Thread Chris Atkinson
I'm trying to join two VARCHAR2 columns of differing length.  The generate
SQL casts the shorter to match the longer.

create table myschema.a_table(
  description varchar2(10)
);

create table myschema.b_table(
  description20 varchar2(20)
);

When the join is attempted

0: jdbc:drill:zk=local> select *
. . . . . . . . . . . >   from utd_utpdba.UTPDBA.A_TABLE
. . . . . . . . . . . >  inner join utd_utpdba.UTPDBA.B_TABLE
. . . . . . . . . . . > on A_TABLE.DESCRIPTION = B_TABLE.DESCRIPTION20;

The following CAST pattern is not acceptable to Oracle (11.2)

Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup
the SQL query.

sql SELECT *
FROM (SELECT "DESCRIPTION", CAST("DESCRIPTION" AS VARCHAR(20) CHARACTER SET
"ISO-8859-1") "$f2"
FROM "UTPDBA"."A_TABLE") "t"
INNER JOIN "UTPDBA"."B_TABLE" ON "t"."$f2" = "B_TABLE"."DESCRIPTION20"
plugin utd_utpdba
Fragment 0:0

[Error Id: 2f9a1975-fa57-4b07-8642-4dca3d03ae39 on x.x.x.x:31010]

  (java.sql.SQLSyntaxErrorException) ORA-00907: missing right parenthesis

oracle.jdbc.driver.T4CTTIoer.processError():450

java.lang.Thread.run():745 (state=,code=0)


Specifically, CHARACTER SET "ISO-8859-1" is not allowed in the cast
statement:

*CAST*({ expr | *MULTISET* (subquery) } *AS* type_name)

https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions016.htm


Cheers!

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group. 

 

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses. 

 

Fetch and Sizzle are trading names of Speciality Stores Limited, a member 
of the Ocado Group.

 

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.


S3 query takes a long time to start

2016-03-10 Thread Oscar Morante
I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes 
ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge nodes 
and it's taking around 50 minutes before the query starts executing.


Any idea what could be causing this delay?  What's the best way to debug 
this?


Thanks,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Time intervals overflowing?

2016-03-10 Thread Oscar Morante
I was trying to generate a sequence of timestamps from a sequence of 
numbers, and run into this:


   select
 596 * interval '1' hour,
 597 * interval '1' hour
   from (values 1);

   EXPR$0 |EXPR$1  |
   ---||
   PT2145600S |PT-2145767.296S |

Notice that the second expression turns into a negative interval.  Is 
that the expected behavior?


Cheers,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature