Hi! I am working on applying WordCount example on the entire Wikipedia dump. The entire english wikipedia is around 200GB which I have stored in HDFS in a cluster to which I have access. The problem: Wikipedia dump contains many directories (it has a very big directory structure) containing HTML files but the FileInputFormat requires all the files to be processed present in a single directory.
Can anybody give any idea or if something already exists for applying Wordcount on these HTML files present in the directories without changing the directory strcuture. Akhil -- View this message in context: http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html Sent from the Hadoop core-user mailing list archive at Nabble.com.