Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Stefán Baxter Thu, 23 Jul 2015 03:55:29 -0700

Hi,

The workaround for this was to edit the first line in the json file and
fake a value for the "additional" field.
That way the optimizer could not decide to ignore it.


Someone must review the underlying optimization errors to prevent this from
happening to others.

JSON data, which is unstructured/schema-free in it's nature can not be
treated as consistent, predictable or monolithic.

I hope this workaround/tip is useful for someone and that someone here
cares enough to create a blocker issue in jira.
(yeah, 14 hours of staring at this and related issues has left me a bit
rude and I know I could focus on appreciating all the effort but that will
have to wait just a bit longer)

- Stefan


On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> in addition to this.
>
> selecting: select some, t.others, t.others.additional from 
> dfs.tmp.`/test.json`
> as t;
> - returns this: "yes", {"additional":"last entries only"}, "last entries
> only"
>
> finding the previously missing value but then ignoring all the other
> values of the sub structure.
>
> - Stefan
>
> On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <ste...@activitystream.com
> > wrote:
>
>> - never returns this: "yes", {"other":"true","all":"
>> false","sometimes":"yes"}
>>
>> should have been:
>>
>> - never returns this: "yes", {"other":"true","all":"
>> false","sometimes":"yes", "additional":"last entries only"}
>>
>> Regards,
>>  -Stefan
>>
>> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
>> ste...@activitystream.com> wrote:
>>
>>> Hi,
>>>
>>> I keep coming across *quirks* in Drill that are quite time consuming to
>>> deal with and are now causing mounting concerns.
>>>
>>> This last one though is far more serious then the previous ones because
>>> it deals with loss of data.
>>>
>>> I'm working with a small(ish) dataset of around 1m records (which I'm
>>> more than happy to hand over to replicate this)
>>>
>>> The problem goes like this:
>>>
>>>    1. with dfs.tmp.`/test.json`
>>>    - containing a structure like this (simplified);
>>>    - 800k x
>>>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>>>    - 100k
>>>    x 
>>> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
>>>    entries only"}}
>>>
>>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
>>>    - returns only this for all the records: "yes",
>>>    {"other":"true","all":"false","sometimes":"yes"}
>>>    - never returns this:
>>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
>>>
>>> The query never returns returns this:
>>> "yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
>>> entries only"} so the last entries in the file are incorrectly represented.
>>>
>>> To make matters a lot worse the the property is completely ignored in:
>>> create X as * from dfs.tmp.`/test.json` and the now parquet file does not
>>> include it at all.
>>>
>>> It looks, to me, that the dynamic schema discovery has stopped looking
>>> for schema changes and is quite set in it's way, so set in fact, that it's
>>> ignoring data.
>>>
>>> I'm guessing that this is potentially affecting more people than me.
>>>
>>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
>>>
>>> Regards,
>>>  -Stefan
>>>
>>
>>
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to