Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Parth Chandra Thu, 23 Jul 2015 14:37:09 -0700

Given the sample rows that Stefan provided, the query -

    select `some`, t.others, t.others.additional from `test.json` t;
does produce incorrect results -


*    | *yes * | *{"additional":"last entries only"} * | *last entries only *
|*

instead of

*    | *yes * |
*{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"} * | *last entries only * |*

Jinfeng, your item #4 is also an issue.


I'll log JIRAs for these.

Stefan, thank you for helping us out with catching these bugs. Your efforts
are really appreciated.

Parth





On Thu, Jul 23, 2015 at 2:19 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> hi,
>
> I can provide you with json file an statements to reproduce it if you wish.
>
> thank you for looking into this.
>
> regards,
>   -Stefan
> On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <jinfengn...@gmail.com> wrote:
>
> > Hi Stefán,
> >
> > Thanks a lot for bringing up this issue, which is really helpful to
> improve
> > Drill.
> >
> > I tried to re-produce the incorrect issues, and I could re-produce the
> > missing data issue of CTAS parquet, but I could not re-produce the
> missing
> > data issue if I query the JSON file directly.
> >
> > Here is how I tried:
> >
> > 1. with dfs.tmp.`test.json`
> >   800k of
> >
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> >   100k of
> >   {"some":"yes","others":{"other":"true","all":"false","
> > sometimes":"yes","additional":"last entries only"}}
> >
> > 2.  SELECT * from dfs.tmp.`test.json`;
> > I put the output of the query into a file. Here is part of the result,
> > shown in vim editor
> >
> > 824000
> >
> >
> +------+------------------------------------------------------------------------------------+
> > 824001 | some |                                       others
> >                         |
> > 824002
> >
> >
> +------+------------------------------------------------------------------------------------+
> > 824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
> >                         |
> > 824004 | yes  |
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> entries
> > only"}  |
> > 824005 | yes  |
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> entries
> > only"}  |
> >
> > The left most number is the line number from vim editor.  The first
> 824003
> > lines have rows without the "additional" field, while beyond that each
> row
> > contains "additional" field.  The line number 824003 (not 800000) comes
> > from the fact Drill's SqlLine add the columnName as the header for every
> > hundreds rows (?).
> >
> > 3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;
> >
> > Same result as above.
> >
> > 4.  USE dfs.tmp;
> >      CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
> >      SELECT * from dfs.tmp.testparquet;
> >
> > This one return the missing data from the generated parquet file.
> >
> >
> >  82400 +------+---------------------------------------------------+
> >  82401 | some |                      others                       |
> >  82402 +------+---------------------------------------------------+
> >  82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >  82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >  82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >
> >
> > So, looks like there is a bug in the parquet writer operator, when it did
> > not output the additional field into parquet files, while the query
> against
> > the JSON seems to return correct result.
> >
> > I just want to confirm whether you see similar behavior on your side.
> >
> > Thanks again!
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> > > Thank you.
> > >
> > >
> > >
> > > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> > > ste...@activitystream.com>
> > > > wrote:
> > > >
> > > > > Someone must review the underlying optimization errors to prevent
> > this
> > > > from
> > > > > happening to others.
> > > > >
> > > >
> > > > Jinfeng and Parth are examining this issue to try to come to a deeper
> > > > understanding.  Not surprisingly, they are a little quiet as they do
> > > this.
> > > >
> > > >
> > > > > JSON data, which is unstructured/schema-free in it's nature can not
> > be
> > > > > treated as consistent, predictable or monolithic.
> > > > >
> > > >
> > > > Indeed.  And Drill vision is based on *exactly* this thought. Right
> > now,
> > > > Drill is still new and does not fulfill all aspects of the vision,
> but
> > we
> > > > are making progress rapidly.
> > > >
> > > > Your contributions and comments have been very helpful, btw.
> > > >
> > >
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to