Hi! 

I am working on applying WordCount example on the entire Wikipedia dump. The
entire english wikipedia is around 200GB which I have stored in HDFS in a
cluster to which I have access. 
The problem: Wikipedia dump contains many directories (it has a very big
directory structure) containing HTML files but the FileInputFormat requires
all the files to be processed present in a single directory. 

Can anybody give any idea or if something already exists for applying
Wordcount on these HTML files present in the directories without changing
the directory strcuture.

Akhil
-- 
View this message in context: 
http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to