[ 
https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934924#comment-15934924
 ] 

Robert Kanter commented on OOZIE-2821:
--------------------------------------

That's a good idea using the local filesystem as the input.  We could use this 
instead of directly uploading.

I'm not sure I follow the part about a zip though.  The thing about HAR 
archives is that they're not _really_ archives.  They can be accessed natively 
by Hadoop if you use the {{har://}} schema.  So if we put the sharelib jars 
into HAR file(s), we could easily add them to the launcher job the same way we 
do today, but change the path to the corresponding {{har://} path.  Another 
thing to keep in mind if we do something with actual archives (i.e. zip files) 
is that they have to be extracted when being localized, which may add some 
overhead for larger sharelib dirs (e.g. Spark).

However, you are right that doing a HAR file would make it harder to manually 
add extra files to the sharelib.  Given that, I think making a separate HAR for 
each sharelib type makes the most sense.  So we could have:
{noformat}
/oozie/share/lib/hive/oozie-hive-sharelib.har (has all of the Oozie supplied 
hive jars)
/oozie/share/lib/hive/custom.jar (user manually uploaded this file)
/oozie/share/lib/hive/hive-site.xml (user manually uploaded this file)
{noformat}

> Using Hadoop Archives for Oozie ShareLib
> ----------------------------------------
>
>                 Key: OOZIE-2821
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2821
>             Project: Oozie
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>
> Oozie ShareLib is a collection of lots of jar files that are required by 
> Oozie actions. Right now, these jars are uploaded one by one with Oozie 
> ShareLib installation. There can more hundreds of such jars, and many of them 
> are pretty small, significantly smaller than a HDFS block size. Storing a 
> large number of small files in HDFS is inefficient (for example due to the 
> fact that there is an object maintained for each file in the NameNode's 
> memory and blocks containing the small files might be much bigger then the 
> actual files). When an action is executed, these jar files are copied to the 
> distributed cache.
> It  would worth to investigate the possibility of using [Hadoop 
> archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html]
>  for handling  Oozie ShareLib files, because it could result in better 
> utilisation of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to