Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Pratyaksh Sharma Fri, 07 Feb 2020 00:39:08 -0800

@Vinoth Chandar <[email protected]> How does re-ordering affect here like
you mentioned? Parquet files access fields by name rather than by index by
default. So re-ordering should not matter. Please help me understand.


On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <[email protected]> wrote:

> @Pratyaksh Sharma <[email protected]> Please go ahead :)
>
> @Benoit , you are right about Parquet deletion, I think.
>
> Come to think of it, with an initial schema in place, how would we even
> drop a field? all of the old data needs to be rewritten (prohibitively
> expensive)? So all we will end up doing is simply mask the field from
> queries by mapping old data to the current schema?  This can get messy
> pretty quickly if field re-ordering is allowed for e.g.. What we do/advise
> now is to alternatively embrace a more brittle schema management at the
> write side (no renames, no dropping fields, all fields are nullable) and
> ensure reader schema is simpler to manage..  There is probably a
> middle-ground here somewhere/
>
>
>
> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
>> @Vinoth Chandar <[email protected]> I would like to drive this.
>>
>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I think deleting field is supported with Avro both backward and forward
>>> as long as the field is optional  and provide default value.
>>>
>>> A simple exemple of Avro optional field defined using a union type and a
>>> default value:
>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
>>> Readers will use default value when field is not present.
>>>
>>> I believe problem here is Parquet which does not support field deletion.
>>> One option is to set Parquet field value to null. Parquet will use RLE
>>> encoding for efficient encoding of all null values in "deleted" field.
>>>
>>> Regards,
>>> Benoit
>>>
>>> > On 6 Feb 2020, at 17:57, Nishith <[email protected]> wrote:
>>> >
>>>
>>> Pratakysh,
>>>
>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
>>> Avro schema evolution rules which helps to prevent breaking of existing
>>> queries on such tables - say someone was querying that field that is now
>>> deleted.
>>> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
>>> That being said, I’m also looking at how we can support schema evolution
>>> slightly differently - somethings could be more in our control and not
>>> break reader queries - but that’s not in the near future.
>>>
>>> Thanks
>>>
>>> Sent from my iPhone
>>>
>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <[email protected]>
>>> wrote:
>>> >
>>> > Hi Vinoth,
>>> >
>>> > We do not have any standard documentation for the said approach as it
>>> was
>>> > self thought through. Just logging a conversation from #general
>>> channel for
>>> > record purpose -
>>> >
>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
>>> got
>>> > an error and I didnt find any solution for this... I wrote some parquet
>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
>>> MOR_STORAGE_TYPE_OPT_VAL
>>> > and sync with hive and worked perfectly. But after that, I try to wrote
>>> > another file in the same table (with some schema changes, just delete
>>> and
>>> > add some columns) and got this error Caused by:
>>> > org.apache.parquet.io.InvalidRecordException:
>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
>>> > what to do?"
>>> >
>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]>
>>> wrote:
>>> >>
>>> >> In my experience, you need to follow some rules on evolving and keep
>>> the
>>> >> data backwards compatible. Or the only other option is to rewrite the
>>> >> entire dataset :), which is very expensive.
>>> >>
>>> >> If you have some pointers to learn more about any approach you are
>>> >> suggesting, happy to read up.
>>> >>
>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
>>> [email protected]>
>>> >> wrote:
>>> >>
>>> >>> Hi Vinoth,
>>> >>>
>>> >>> As you explained above and as per what is mentioned in this FAQ (
>>> >>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>> >>> ),
>>> >>> Hudi is able to maintain schema evolution only if the schema is
>>> >> *backwards
>>> >>> compatible*. What about the case when it is backwards incompatible?
>>> This
>>> >>> might be the case when for some reason you are unable to enforce
>>> things
>>> >>> like not deleting fields or not change the order. Ideally we should
>>> be
>>> >> full
>>> >>> proof and be able to support schema evolution in every case
>>> possible. In
>>> >>> such a case, creating a Uber schema can be useful. WDYT?
>>> >>>
>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>>> Hi Syed,
>>> >>>>
>>> >>>> Typically, I have been the Confluent/avro schema registry used as a
>>> the
>>> >>>> source of truth and Hive schema is just a translation. Thats how the
>>> >>>> hudi-hive sync also works..
>>> >>>> Have you considered making fields optional in the avro schema so
>>> that
>>> >>> even
>>> >>>> if the source data does not have few of them, there will be nulls..
>>> >>>> In general, the two places I have dealt with this, all made it works
>>> >>> using
>>> >>>> the schema evolution rules avro supports.. and enforcing things like
>>> >> not
>>> >>>> deleting fields, not changing order etc.
>>> >>>>
>>> >>>> Hope that atleast helps a bit
>>> >>>>
>>> >>>> thanks
>>> >>>> vinoth
>>> >>>>
>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
>>> [email protected]
>>> >>>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Hi Team,
>>> >>>>>
>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
>>> >>> maintained
>>> >>>>> in the schema registry by confluent framework during the population
>>> >> of
>>> >>>>> data.
>>> >>>>>
>>> >>>>> *Problem Statement Here: *
>>> >>>>>
>>> >>>>> All the addition/deletion of columns is maintained in schema
>>> >> registry.
>>> >>>>> During running the Hudi pipeline, We have custom schema registry
>>> >> that
>>> >>>>> pulls the latest schema from the schema registry as well as from
>>> hive
>>> >>>>> metastore and we create a uber schema (so that missing the columns
>>> >> from
>>> >>>> the
>>> >>>>> schema registry will be pulled from hive metastore) But is there
>>> any
>>> >>>> better
>>> >>>>> approach to solve this problem?.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>           Thanks and Regards,
>>> >>>>>       S SYED ABDUL KATHER
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>>
>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Reply via email to