[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

Swarnim Kulkarni (JIRA) Sun, 31 Jan 2016 19:15:07 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125676#comment-15125676
 ]


Swarnim Kulkarni commented on HIVE-6147:
----------------------------------------

{quote}
It is pretty common to use schema-less avro objects in HBase.
{quote}

I am not sure if that is true(if possible at all). As far as my understanding 
goes, you will have to almost always provide the exact schema that was used 
while persisting the data when attempting to deserialize it and the best way to 
do that would be to store alongside the schema itself. Plus schema evolution is 
going to be a mess. Imagine writing a billion rows in HBase with one schema 
which evolves and then you write another billion rows with new schema. How do 
you ensure the first billion rows are still correctly readable?

{quote}
(if there are billions of rows with objects of the same type, it is not 
reasonable to store the same schema in all of them) and it is not convenient to 
write a customer schema retriever for each such case.
{quote}

Correct. I agree it is inefficient to store it for every single cell. Although 
IMO that isn't a good excuse to not write the schema at all. A better design in 
this case is to use some kind of schema registry, use a custom serializer, 
write the schema to the schema registry, generate a id of some kind and persist 
the id along with the data. Then when you are reading the data, use the id to 
pull the schema from the store and read the data. That is also where a custom 
implementation of an AvroSchemaRetriever makes sense where your custom 
implementation would know how to read your schema from the schema registry and 
get that to hive and let hive handle the deserialization from there on.  

> Support avro data stored in HBase columns
> -----------------------------------------
>
>                 Key: HIVE-6147
>                 URL: https://issues.apache.org/jira/browse/HIVE-6147
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Swarnim Kulkarni
>            Assignee: Swarnim Kulkarni
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

Reply via email to