[ https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934924#comment-15934924 ]
Robert Kanter commented on OOZIE-2821: -------------------------------------- That's a good idea using the local filesystem as the input. We could use this instead of directly uploading. I'm not sure I follow the part about a zip though. The thing about HAR archives is that they're not _really_ archives. They can be accessed natively by Hadoop if you use the {{har://}} schema. So if we put the sharelib jars into HAR file(s), we could easily add them to the launcher job the same way we do today, but change the path to the corresponding {{har://} path. Another thing to keep in mind if we do something with actual archives (i.e. zip files) is that they have to be extracted when being localized, which may add some overhead for larger sharelib dirs (e.g. Spark). However, you are right that doing a HAR file would make it harder to manually add extra files to the sharelib. Given that, I think making a separate HAR for each sharelib type makes the most sense. So we could have: {noformat} /oozie/share/lib/hive/oozie-hive-sharelib.har (has all of the Oozie supplied hive jars) /oozie/share/lib/hive/custom.jar (user manually uploaded this file) /oozie/share/lib/hive/hive-site.xml (user manually uploaded this file) {noformat} > Using Hadoop Archives for Oozie ShareLib > ---------------------------------------- > > Key: OOZIE-2821 > URL: https://issues.apache.org/jira/browse/OOZIE-2821 > Project: Oozie > Issue Type: New Feature > Reporter: Attila Sasvari > > Oozie ShareLib is a collection of lots of jar files that are required by > Oozie actions. Right now, these jars are uploaded one by one with Oozie > ShareLib installation. There can more hundreds of such jars, and many of them > are pretty small, significantly smaller than a HDFS block size. Storing a > large number of small files in HDFS is inefficient (for example due to the > fact that there is an object maintained for each file in the NameNode's > memory and blocks containing the small files might be much bigger then the > actual files). When an action is executed, these jar files are copied to the > distributed cache. > It would worth to investigate the possibility of using [Hadoop > archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html] > for handling Oozie ShareLib files, because it could result in better > utilisation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)