Stefan,

I took a look at the issue and I think I have a fix for the corruption you
are seeing. There have been a number of substantial commits to master
including a refactoring of a number of modules, so I applied this change on
top of the 1.3 branch for you to build and try out. I would like to add
some additional test cases, at which point I will open up and official PR
against master and we will likely be able to pull it back onto the 1.3
branch for inclusion in the release.

Please try this out to see if there are remaining issues reading your data.

https://github.com/jaltekruse/incubator-drill/tree/4056-avro-corruption-bug

Thanks,
Jason



On Fri, Nov 13, 2015 at 2:58 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> So,
>
> Could someone point me to the appropriate place in the Drill code to start
> investigating this (We would love to contribute but getting up to speed is
> a bit much).
>
> I realize that there are many good things happening and that v. 1.3 is
> around the corner but it seems that I incorrectly assumed that data
> corruption issues would get a higher priority or that I would, at the very
> least, get someone to confirm such a bug.
>
> We are now impeded by this after having moved all our logging from JSON to
> Avro to avoid the schema related problems we have been running into with
> the JSON reader (null interpreted like double and failing when a string
> eventually comes along) .
>
> - Stefan
>
>
> On Wed, Nov 11, 2015 at 10:14 PM, Stefán Baxter <ste...@activitystream.com
> >
> wrote:
>
> > Hi,
> >
> > Can someone please verify that this is in fact a bug so I can rule out
> our
> > own mistakes?
> >
> > We have recently moved all our logging to Avro to compensate for schema
> > differences in JSON that were causing various problems and our latest
> > release is now impeded with this.
> > Alternatively can someone please point me in the right direction if I was
> > to try to fix this myself.
> >
> > Regards,
> >   -Stefán
> >
> > On Tue, Nov 10, 2015 at 2:41 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> >> Thank you Kamesh.
> >>
> >> I have created https://issues.apache.org/jira/browse/DRILL-4056 with
> the
> >> description.
> >> I will send you a confidential test file to your private email.
> >>
> >> Regards,
> >>  -Stefan
> >>
> >> On Tue, Nov 10, 2015 at 2:30 PM, Kamesh <kamesh.had...@gmail.com>
> wrote:
> >>
> >>> Hi Stefán,
> >>>  Could you please raise a Jira with sample schema and sample input to
> >>> reproduce it. I will look into this.
> >>>
> >>> On Tue, Nov 10, 2015 at 7:55 PM, Stefán Baxter <
> >>> ste...@activitystream.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > I have an Avro file that support the following data/schema:
> >>> >
> >>> > {"field":"some", "classification":{"variant":"Gæst"}}
> >>> >
> >>> > When I select 10 rows from this file I get:
> >>> >
> >>> > +---------------------+
> >>> > |       EXPR$0        |
> >>> > +---------------------+
> >>> > | Gæst                |
> >>> > | Voksen              |
> >>> > | Voksen              |
> >>> > | Invitation KIF KBH  |
> >>> > | Invitation KIF KBH  |
> >>> > | Ordinarie pris KBH  |
> >>> > | Ordinarie pris KBH  |
> >>> > | Biljetter 200 krBH  |
> >>> > | Biljetter 200 krBH  |
> >>> > | Biljetter 200 krBH  |
> >>> > +---------------------+
> >>> >
> >>> > The bug is that the field values are incorrectly de-serialized and
> the
> >>> > value from the previous row is retained if the subsequent row is
> >>> shorter.
> >>> >
> >>> > The sql query:
> >>> >
> >>> > "select s.classification.variant variant from dfs.<some> as s limit
> >>> 10;"
> >>> >
> >>> >
> >>> > That way the  "Ordinarie pris" becomes "Ordinarie pris KBH" because
> the
> >>> > previous row had the value "Invitation KIF KBH".
> >>> >
> >>> > Regards,
> >>> >   -Stefán
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Kamesh.
> >>>
> >>
> >>
> >
>

Reply via email to