Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Stefán Baxter Thu, 23 Jul 2015 08:19:10 -0700

Hi Abdel,

Thank you for taking the time to respond. I know my frustration is leaking
through but that does not mean I don appreciate everything you and the
Drill team is doing, I do.


I also understand the premise of the optimization but I find it to
restrictive and it certainly does not fit our data where our customers are
responsible for part of the data (mixed schema).

Parquet seems to know how to deal with new properties and some systems,
like Spark, know how to deal with changes in (shema) property types over
time.

The main concern is this; Drill is a great attempt to support variable data
sources and evolving/mixed/variable schema but it does seems to be doing so
with a lot of restrictions that are counter productive to that goal.

I have had tests fail because fields are incorrectly assumed numbers if
enough of them, up front, are null values and then the system blows when a
string finally shows up. i have had string values, containing numbers, blow
up because previous values were "real numbers" without an a attempt being
done to convert the value to the "strict representation".

This, in addition to the other problems I have come across (in Jira), have
lead me to believe that Drill is great, just as long as all your data has
been sanitized. And that is a shame because the "Time to value" and
"dynamic schema discovery" promise fall a bit short under those
circumstances. I hope this will mature nicely over time and become more
forgiving and understanding to "data in the wild".

For now I need to be able to change that 1000 limit or eliminate it all
together, please let me know if that can be done.

Ultimately I fail to understand how the a change within a single json file
can ever be considered as "unsupported schema change".

very best,
 -Stefan






On Thu, Jul 23, 2015 at 2:53 PM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Hi Stefan,
>
> Sorry to hear about your misadventure in Drill land. I will try to give you
> some more informations, but I also have limited knowledge for this specific
> case and other developers will probably jump in to correct me.
>
> When you try to read schema-less data, Drill will first investigate the
> 1000 rows to figure out a schema for your data, then it will use this
> schema for the remaining of the query. This explains why your workaround
> works because Drill was able to "figure out" that extra field, and will
> fill it with null whenever it's missing.
>
> To my knowledge, when the data contains a "schema change" (like in your
> case) you generally get an error message stating that "operator X doesn't
> support schema changes", so I am not sure why you are getting incorrect
> results in this case.
>
> You should definitely fill a JIRA for this and mark it as critical. We try
> to fix cases where a query returns incorrect results as soon as possible.
>
> Thank you again for all the efforts your are putting in Drill
>
>
> On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > The workaround for this was to edit the first line in the json file and
> > fake a value for the "additional" field.
> > That way the optimizer could not decide to ignore it.
> >
> > Someone must review the underlying optimization errors to prevent this
> from
> > happening to others.
> >
> > JSON data, which is unstructured/schema-free in it's nature can not be
> > treated as consistent, predictable or monolithic.
> >
> > I hope this workaround/tip is useful for someone and that someone here
> > cares enough to create a blocker issue in jira.
> > (yeah, 14 hours of staring at this and related issues has left me a bit
> > rude and I know I could focus on appreciating all the effort but that
> will
> > have to wait just a bit longer)
> >
> > - Stefan
> >
> >
> > On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > in addition to this.
> > >
> > > selecting: select some, t.others, t.others.additional from
> > dfs.tmp.`/test.json`
> > > as t;
> > > - returns this: "yes", {"additional":"last entries only"}, "last
> entries
> > > only"
> > >
> > > finding the previously missing value but then ignoring all the other
> > > values of the sub structure.
> > >
> > > - Stefan
> > >
> > > On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <
> > ste...@activitystream.com
> > > > wrote:
> > >
> > >> - never returns this: "yes", {"other":"true","all":"
> > >> false","sometimes":"yes"}
> > >>
> > >> should have been:
> > >>
> > >> - never returns this: "yes", {"other":"true","all":"
> > >> false","sometimes":"yes", "additional":"last entries only"}
> > >>
> > >> Regards,
> > >>  -Stefan
> > >>
> > >> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
> > >> ste...@activitystream.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I keep coming across *quirks* in Drill that are quite time consuming
> to
> > >>> deal with and are now causing mounting concerns.
> > >>>
> > >>> This last one though is far more serious then the previous ones
> because
> > >>> it deals with loss of data.
> > >>>
> > >>> I'm working with a small(ish) dataset of around 1m records (which I'm
> > >>> more than happy to hand over to replicate this)
> > >>>
> > >>> The problem goes like this:
> > >>>
> > >>>    1. with dfs.tmp.`/test.json`
> > >>>    - containing a structure like this (simplified);
> > >>>    - 800k x
> > >>>
> > {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> > >>>    - 100k
> > >>>    x
> >
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
> > >>>    entries only"}}
> > >>>
> > >>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as
> t;
> > >>>    - returns only this for all the records: "yes",
> > >>>    {"other":"true","all":"false","sometimes":"yes"}
> > >>>    - never returns this:
> > >>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
> > >>>
> > >>> The query never returns returns this:
> > >>> "yes",
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> > >>> entries only"} so the last entries in the file are incorrectly
> > represented.
> > >>>
> > >>> To make matters a lot worse the the property is completely ignored
> in:
> > >>> create X as * from dfs.tmp.`/test.json` and the now parquet file does
> > not
> > >>> include it at all.
> > >>>
> > >>> It looks, to me, that the dynamic schema discovery has stopped
> looking
> > >>> for schema changes and is quite set in it's way, so set in fact, that
> > it's
> > >>> ignoring data.
> > >>>
> > >>> I'm guessing that this is potentially affecting more people than me.
> > >>>
> > >>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
> > >>>
> > >>> Regards,
> > >>>  -Stefan
> > >>>
> > >>
> > >>
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to