[ 
https://issues.apache.org/jira/browse/HIVE-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697368#comment-14697368
 ] 

Alan Gates commented on HIVE-11500:
-----------------------------------

bq. I think we should use YAGNI principle...
That's fine, but it means on the next interface you want to add you have to 
convince me that you should add another set of calls rather than refactor this 
one to be generic.

bq.  Having many methods on metastore is not really that big of a deal, since 
they do different things.
I disagree.  Having just implemented a new version of RawStore I can tell you 
that it took me a long time to understand the nuances of why there's five 
different ways to fetch partitions.  I'm still not sure I have it all straight. 
 We should not just add a new call each time because it's the shortest path to 
get the new thing working.  We need to think about code maintenance and 
understandability for future developers.

> implement file footer / splits cache in HBase metastore
> -------------------------------------------------------
>
>                 Key: HIVE-11500
>                 URL: https://issues.apache.org/jira/browse/HIVE-11500
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HBase metastore split cache.pdf
>
>
> We need to cache file metadata (e.g. ORC file footers) for split generation 
> (which, on FSes that support fileId, will be valid permanently and only needs 
> to be removed lazily when ORC file is erased or compacted), and potentially 
> even some information about splits (e.g. grouping based on location that 
> would be good for some short time), in HBase metastore.
> -It should be queryable by table. Partition predicate pushdown should be 
> supported. If bucket pruning is added, that too.- Given that we cannot cache 
> file lists (we have to check FS for new/changed files anyway), and the 
> difficulty of passing of data about partitions/etc. to split generation 
> compared to paths, we will probably just filter by paths and fileIds. It 
> might be different for splits
> In later phases, it would be nice to save the (first category above) results 
> of expensive work done by jobs, e.g. data size after decompression/decoding 
> per column, etc. to avoid surprises when ORC encoding is very good, or very 
> bad. Perhaps it can even be lazily generated. Here's a pony: 🐴



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to