Hi, I am a CS undergraduate working with hadoop. I wrote a library to process logs, my input directory has the following structure:
logs_hourly ├── dt=2013-02-15 │ ├── ts=1360887451 │ │ └── syslog-2013-02-15-1360887451.gz │ └── ts=1360891051 │ └── syslog-2013-02-15-1360891051.gz ├── dt=2013-02-14 │ ├── ts= 1360801050 │ │ └── syslog-2013-02-14-1360801050.gz │ └── ts=1360804651 │ └── syslog-2013-02-14-1360804651.gz Where dt is the day and ts is the hour when the log was created. Currently, the code takes an input directory (or a range of input directories) such as dt=2013-02-15 and goes through every file in every subdirectory sequentially with a loop. This process is slow and I think that running the code on the files in parallel would be more efficient. Is there any where that I could use Hadoop's MapReduce on a directory such as dt=2013-02-15 and receive the same directory structure as output? Thanks, Max Lebedev -- View this message in context: http://hadoop.6.n7.nabble.com/Running-hadoop-on-directory-structure-tp67904.html Sent from the common-user mailing list archive at Nabble.com.