We have a dataset of ~8Milllion files about .5 to 2 Megs each. And we're having trouble getting them analysed after building a har file.
The files are already in a pre-existing directory structure, with, two nested set of dirs with 20-100 pdfs at the bottom of each leaf of the dir tree. user->hadoop->/all_the_files/*/*/*.pdf It was trivial to move these to hdfs and to build a har archive; I used the following command to make the archive bin/hadoop archive -archiveName test.har -p /user/hadoop/ all_the_files/*/*/ /user/hadoop/ Listing the contents of the har (bin/hadoop fs -lsr har:///user/hadoop/epc_test.har) and everything looks as I'd expect. When we come to run the hadoop job with this command, trying to wildcard the archive: bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ output it fails with the following exception Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string Running the job with the non-archived files is fine i.e: bin/hadoop jar My.jar all_the_files/*/*/ output However this only works for our modest test set of files. Any substantial number of files quickly makes the namenode run out of memory. Can you use file globs with the har archives? Is there a different way to build the archive to just include the files which I've missed? I appreciate that a sequence file might be a better fit for this task but I'd like to know the solution to this issue if there is one. -- t. 020 7739 3277 a. 131 Shoreditch High Street, London E1 6JE