Hi Vinoth,

We do not have any standard documentation for the said approach as it was
self thought through. Just logging a conversation from #general channel for
record purpose -

"Hello people, I'm doing a POC to use HUDI in our data pipeline, but I got
an error and I didnt find any solution for this... I wrote some parquet
files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
and sync with hive and worked perfectly. But after that, I try to wrote
another file in the same table (with some schema changes, just delete and
add some columns) and got this error Caused by:
org.apache.parquet.io.InvalidRecordException:
Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
what to do?"

On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]> wrote:

> In my experience, you need to follow some rules on evolving and keep the
> data backwards compatible. Or the only other option is to rewrite the
> entire dataset :), which is very expensive.
>
> If you have some pointers to learn more about any approach you are
> suggesting, happy to read up.
>
> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > Hi Vinoth,
> >
> > As you explained above and as per what is mentioned in this FAQ (
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> > ),
> > Hudi is able to maintain schema evolution only if the schema is
> *backwards
> > compatible*. What about the case when it is backwards incompatible? This
> > might be the case when for some reason you are unable to enforce things
> > like not deleting fields or not change the order. Ideally we should be
> full
> > proof and be able to support schema evolution in every case possible. In
> > such a case, creating a Uber schema can be useful. WDYT?
> >
> > On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Hi Syed,
> > >
> > > Typically, I have been the Confluent/avro schema registry used as a the
> > > source of truth and Hive schema is just a translation. Thats how the
> > > hudi-hive sync also works..
> > > Have you considered making fields optional in the avro schema so that
> > even
> > > if the source data does not have few of them, there will be nulls..
> > > In general, the two places I have dealt with this, all made it works
> > using
> > > the schema evolution rules avro supports.. and enforcing things like
> not
> > > deleting fields, not changing order etc.
> > >
> > > Hope that atleast helps a bit
> > >
> > > thanks
> > > vinoth
> > >
> > > On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > We have pull data from Kafka generated by Debezium. The schema
> > maintained
> > > > in the schema registry by confluent framework during the population
> of
> > > > data.
> > > >
> > > > *Problem Statement Here: *
> > > >
> > > > All the addition/deletion of columns is maintained in schema
> registry.
> > > >  During running the Hudi pipeline, We have custom schema registry
> that
> > > > pulls the latest schema from the schema registry as well as from hive
> > > > metastore and we create a uber schema (so that missing the columns
> from
> > > the
> > > > schema registry will be pulled from hive metastore) But is there any
> > > better
> > > > approach to solve this problem?.
> > > >
> > > >
> > > >
> > > >
> > > >             Thanks and Regards,
> > > >         S SYED ABDUL KATHER
> > > >
> > >
> >
>

Reply via email to