[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Gunther Hagleitner (JIRA) Thu, 16 Oct 2014 00:01:55 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173455#comment-14173455
 ]


Gunther Hagleitner commented on HIVE-8467:
------------------------------------------

Materialized views don't necessarily have to keep the tables in sync, do they? 
Other vendors allow deferred refreshes and for the user to specify integrity 
levels. I.e.: You can still put the onus on the user and you don't necessarily 
have to offer a background sync method (you can choose to additional options 
later.)

As far as other engines go - you have the same problem right? You can expose 
the table copy or view, but the smarts how and when to rewrite queries has to 
be built into each of those, or left to the user. With materialized views, 
other engines will also know how the tables are derived, which seems beneficial 
(well, if they speak SQL at least). For Pig and MR you will likely have to bake 
assumptions into the scripts/code.

Could say more about retention policy, max size and in general how you have 
seen ppl choose which partitions to add to the table copy? Is it typically the 
newest n partition? Or the last month of data? That'd be interesting - to see 
if it can be mapped on materialized views and how hard it'd be for the CBO can 
handle it.


> Table Copy - Background, incremental data load
> ----------------------------------------------
>
>                 Key: HIVE-8467
>                 URL: https://issues.apache.org/jira/browse/HIVE-8467
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Rajat Venkatesh
>         Attachments: Table Copies.pdf
>
>
> Traditionally, Hive and other tools in the Hadoop eco-system havent required 
> a load stage. However, with recent developments, Hive is much more performant 
> when data is stored in specific formats like ORC, Parquet, Avro etc. 
> Technologies like Presto, also work much better with certain data formats. At 
> the same time, data is generated or obtained from 3rd parties in non-optimal 
> formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
> change the data format at the source. We've found that users either use 
> sub-optimal formats or spend a large amount of effort creating and 
> maintaining copies. We want to propose a new construct - Table Copy - to help 
> “load” data into an optimal storage format.
> I am going to attach a PDF document with a lot more details especially 
> addressing how is this different from bulk loads in relational DBs or 
> materialized views.
> Looking forward to hear if others see a similar need to formalize conversion 
> of data to different storage formats.  If yes, are the details in the PDF 
> document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

Reply via email to