Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Jinfeng Ni
Parth, You are right. If we put t.others.additional in select list, in addition to t.others, then the output is wrong. The JSON file I used has 2 rows: {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}} {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes",

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Parth Chandra
Given the sample rows that Stefan provided, the query - select `some`, t.others, t.others.additional from `test.json` t; does produce incorrect results - *| *yes * | *{"additional":"last entries only"} * | *last entries only * |* instead of *| *yes * | *{"other":"true","all":"false"

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
hi, I can provide you with json file an statements to reproduce it if you wish. thank you for looking into this. regards, -Stefan On Jul 23, 2015 9:03 PM, "Jinfeng Ni" wrote: > Hi Stefán, > > Thanks a lot for bringing up this issue, which is really helpful to improve > Drill. > > I tried to

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Jinfeng Ni
Hi Stefán, Thanks a lot for bringing up this issue, which is really helpful to improve Drill. I tried to re-produce the incorrect issues, and I could re-produce the missing data issue of CTAS parquet, but I could not re-produce the missing data issue if I query the JSON file directly. Here is ho

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Thank you. On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning wrote: > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter > wrote: > > > Someone must review the underlying optimization errors to prevent this > from > > happening to others. > > > > Jinfeng and Parth are examining this issue to try to co

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Ted Dunning
On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter wrote: > Someone must review the underlying optimization errors to prevent this from > happening to others. > Jinfeng and Parth are examining this issue to try to come to a deeper understanding. Not surprisingly, they are a little quiet as they do

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Abdel Hakim Deneche
I don't think Drill is supposed to "ignore" data. My understanding is that the reader will read the new fields which will cause a schema change, and depending on the query (if all operators involved can handle the schema change or not) the query should either succeed or fail. My understanding is th

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi, The only right answer to this question must be to a) "adapt to additional information" and b) "try the hardest to accommodate changes". The current behavior must be seen as completely worthless (sorry for the strong language). Regards, -Stefan On Thu, Jul 23, 2015 at 4:16 PM, Matt wrote:

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Matt
On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote: When you try to read schema-less data, Drill will first investigate the 1000 rows to figure out a schema for your data, then it will use this schema for the remaining of the query. To clarify, if the JSON schema changes on the 1001st 1MMth

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi Abdel, Thank you for taking the time to respond. I know my frustration is leaking through but that does not mean I don appreciate everything you and the Drill team is doing, I do. I also understand the premise of the optimization but I find it to restrictive and it certainly does not fit our d

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Abdel Hakim Deneche
Hi Stefan, Sorry to hear about your misadventure in Drill land. I will try to give you some more informations, but I also have limited knowledge for this specific case and other developers will probably jump in to correct me. When you try to read schema-less data, Drill will first investigate the

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi, The workaround for this was to edit the first line in the json file and fake a value for the "additional" field. That way the optimizer could not decide to ignore it. Someone must review the underlying optimization errors to prevent this from happening to others. JSON data, which is unstruct

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-22 Thread Stefán Baxter
in addition to this. selecting: select some, t.others, t.others.additional from dfs.tmp.`/test.json` as t; - returns this: "yes", {"additional":"last entries only"}, "last entries only" finding the previously missing value but then ignoring all the other values of the sub structure. - Stefan On

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-22 Thread Stefán Baxter
- never returns this: "yes", {"other":"true","all":" false","sometimes":"yes"} should have been: - never returns this: "yes", {"other":"true","all":" false","sometimes":"yes", "additional":"last entries only"} Regards, -Stefan On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter wrote: > Hi, > >

Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-22 Thread Stefán Baxter
Hi, I keep coming across *quirks* in Drill that are quite time consuming to deal with and are now causing mounting concerns. This last one though is far more serious then the previous ones because it deals with loss of data. I'm working with a small(ish) dataset of around 1m records (which I'm m