Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Abdel Hakim Deneche Thu, 23 Jul 2015 07:54:46 -0700

Hi Stefan,

Sorry to hear about your misadventure in Drill land. I will try to give you
some more informations, but I also have limited knowledge for this specific
case and other developers will probably jump in to correct me.


When you try to read schema-less data, Drill will first investigate the
1000 rows to figure out a schema for your data, then it will use this
schema for the remaining of the query. This explains why your workaround
works because Drill was able to "figure out" that extra field, and will
fill it with null whenever it's missing.

To my knowledge, when the data contains a "schema change" (like in your
case) you generally get an error message stating that "operator X doesn't
support schema changes", so I am not sure why you are getting incorrect
results in this case.

You should definitely fill a JIRA for this and mark it as critical. We try
to fix cases where a query returns incorrect results as soon as possible.

Thank you again for all the efforts your are putting in Drill


On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> The workaround for this was to edit the first line in the json file and
> fake a value for the "additional" field.
> That way the optimizer could not decide to ignore it.
>
> Someone must review the underlying optimization errors to prevent this from
> happening to others.
>
> JSON data, which is unstructured/schema-free in it's nature can not be
> treated as consistent, predictable or monolithic.
>
> I hope this workaround/tip is useful for someone and that someone here
> cares enough to create a blocker issue in jira.
> (yeah, 14 hours of staring at this and related issues has left me a bit
> rude and I know I could focus on appreciating all the effort but that will
> have to wait just a bit longer)
>
> - Stefan
>
>
> On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <ste...@activitystream.com
> >
> wrote:
>
> > in addition to this.
> >
> > selecting: select some, t.others, t.others.additional from
> dfs.tmp.`/test.json`
> > as t;
> > - returns this: "yes", {"additional":"last entries only"}, "last entries
> > only"
> >
> > finding the previously missing value but then ignoring all the other
> > values of the sub structure.
> >
> > - Stefan
> >
> > On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <
> ste...@activitystream.com
> > > wrote:
> >
> >> - never returns this: "yes", {"other":"true","all":"
> >> false","sometimes":"yes"}
> >>
> >> should have been:
> >>
> >> - never returns this: "yes", {"other":"true","all":"
> >> false","sometimes":"yes", "additional":"last entries only"}
> >>
> >> Regards,
> >>  -Stefan
> >>
> >> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
> >> ste...@activitystream.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> I keep coming across *quirks* in Drill that are quite time consuming to
> >>> deal with and are now causing mounting concerns.
> >>>
> >>> This last one though is far more serious then the previous ones because
> >>> it deals with loss of data.
> >>>
> >>> I'm working with a small(ish) dataset of around 1m records (which I'm
> >>> more than happy to hand over to replicate this)
> >>>
> >>> The problem goes like this:
> >>>
> >>>    1. with dfs.tmp.`/test.json`
> >>>    - containing a structure like this (simplified);
> >>>    - 800k x
> >>>
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> >>>    - 100k
> >>>    x
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
> >>>    entries only"}}
> >>>
> >>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
> >>>    - returns only this for all the records: "yes",
> >>>    {"other":"true","all":"false","sometimes":"yes"}
> >>>    - never returns this:
> >>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
> >>>
> >>> The query never returns returns this:
> >>> "yes",
> {"other":"true","all":"false","sometimes":"yes","additional":"last
> >>> entries only"} so the last entries in the file are incorrectly
> represented.
> >>>
> >>> To make matters a lot worse the the property is completely ignored in:
> >>> create X as * from dfs.tmp.`/test.json` and the now parquet file does
> not
> >>> include it at all.
> >>>
> >>> It looks, to me, that the dynamic schema discovery has stopped looking
> >>> for schema changes and is quite set in it's way, so set in fact, that
> it's
> >>> ignoring data.
> >>>
> >>> I'm guessing that this is potentially affecting more people than me.
> >>>
> >>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
> >>>
> >>> Regards,
> >>>  -Stefan
> >>>
> >>
> >>
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to