Rajat Venkatesh created HIVE-8467:
-------------------------------------

             Summary: Table Copy - Background, incremental data load
                 Key: HIVE-8467
                 URL: https://issues.apache.org/jira/browse/HIVE-8467
             Project: Hive
          Issue Type: New Feature
            Reporter: Rajat Venkatesh


Traditionally, Hive and other tools in the Hadoop eco-system havent required a 
load stage. However, with recent developments, Hive is much more performant 
when data is stored in specific formats like ORC, Parquet, Avro etc. 
Technologies like Presto, also work much better with certain data formats. At 
the same time, data is generated or obtained from 3rd parties in non-optimal 
formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
change the data format at the source. We've found that users either use 
sub-optimal formats or spend a large amount of effort creating and maintaining 
copies. We want to propose a new construct - Table Copy - to help “load” data 
into an optimal storage format.

I am going to attach a PDF document with a lot more details especially 
addressing how is this different from bulk loads in relational DBs or 
materialized views.

Looking forward to hear if others see a similar need to formalize conversion of 
data to different storage formats.  If yes, are the details in the PDF document 
a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to