Rajat Venkatesh created HIVE-8467:
-------------------------------------
Summary: Table Copy - Background, incremental data load
Key: HIVE-8467
URL: https://issues.apache.org/jira/browse/HIVE-8467
Project: Hive
Issue Type: New Feature
Reporter: Rajat Venkatesh
Traditionally, Hive and other tools in the Hadoop eco-system havent required a
load stage. However, with recent developments, Hive is much more performant
when data is stored in specific formats like ORC, Parquet, Avro etc.
Technologies like Presto, also work much better with certain data formats. At
the same time, data is generated or obtained from 3rd parties in non-optimal
formats such as CSV, tab-limited or JSON. Many a times, its not an option to
change the data format at the source. We've found that users either use
sub-optimal formats or spend a large amount of effort creating and maintaining
copies. We want to propose a new construct - Table Copy - to help “load” data
into an optimal storage format.
I am going to attach a PDF document with a lot more details especially
addressing how is this different from bulk loads in relational DBs or
materialized views.
Looking forward to hear if others see a similar need to formalize conversion of
data to different storage formats. If yes, are the details in the PDF document
a good start ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)