[ 
https://issues.apache.org/jira/browse/HIVE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904560#comment-13904560
 ] 

Brock Noland commented on HIVE-860:
-----------------------------------

bq. Can we keep a single bundle for the hive internal pieces? I think that's 
orthogonal to the caching effort and seems more efficient to me than breaking 
it all into smaller bits and also let's us shade what needs shading. It also 
doesn't change how we handle these things as drastically.

The issue, as I understand it, is that Hadoop unjars the jar it ships to the 
cluster by default. In Hive's case this is the hive-exec jar. I got this from 
here: 
https://issues.apache.org/jira/browse/PIG-2672?focusedCommentId=13263874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13263874
 and here https://issues.apache.org/jira/browse/HCATALOG-385

which means that having a large hive-exec jar causes a large penalty for each 
query. That is one of the reasons I undid the uber hive-exec jar. The other 
reason is that it's a bad practice to build uber jars as the main artifact for 
a module. One of the issues is that users cannot use their own version of the 
libraries packed into the uber jar and in fact it's often a frustrating ordeal 
to figure out why your version of the library is not taking affect.

For these reasons I think we should make the uber jar as small as possible and 
include only things are specifically shading.

bq. Seems in pig they made the caching optional - can we do that too? In case 
someone has issues with caching it in the user directory?

This is a good idea, I will update the patch to do this.

bq. Finally a thought for file formats:. It would be nice to only pull the 
dependencies when they are actually needed not every time you run a query. That 
way you're not penalized for adding as many as you want and external serdes can 
play too. We could extend the serde API with an optional call to retrieve 
additional jars to be localized.

I agree, this would be ideal. I think it's future work though. This change 
speeds up queries by reusing the majority of the old hive-exec jar on each 
query so I don't want to hold off on "good" for "best".

> Persistent distributed cache
> ----------------------------
>
>                 Key: HIVE-860
>                 URL: https://issues.apache.org/jira/browse/HIVE-860
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Zheng Shao
>            Assignee: Brock Noland
>             Fix For: 0.13.0
>
>         Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, 
> HIVE-860.patch, HIVE-860.patch
>
>
> DistributedCache is shared across multiple jobs, if the hdfs file name is the 
> same.
> We need to make sure Hive put the same file into the same location every time 
> and do not overwrite if the file content is the same.
> We can achieve 2 different results:
> A1. Files added with the same name, timestamp, and md5 in the same session 
> will have a single copy in distributed cache.
> A2. Filed added with the same name, timestamp, and md5 will have a single 
> copy in distributed cache.
> A2 has a bigger benefit in sharing but may raise a question on when Hive 
> should clean it up in hdfs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to