Running hadoop on directory structure

Max Lebedev Fri, 15 Feb 2013 10:41:49 -0800

Hi, I am a CS undergraduate working with hadoop. I wrote a library to process
logs, my input directory has the following structure:


logs_hourly
├── dt=2013-02-15
│   ├── ts=1360887451
│   │   └── syslog-2013-02-15-1360887451.gz
│   └── ts=1360891051
│       └── syslog-2013-02-15-1360891051.gz
├── dt=2013-02-14
│   ├── ts= 1360801050
│   │   └── syslog-2013-02-14-1360801050.gz
│   └── ts=1360804651
│       └── syslog-2013-02-14-1360804651.gz

Where dt is the day and ts is the hour when the log was created.

Currently, the code takes an input directory (or a range of input
directories) such as dt=2013-02-15 and goes through every file in every
subdirectory sequentially with a loop. This process is slow and I think that
running the code on the files in parallel would be more efficient. Is there
any where that I could use Hadoop's MapReduce on a directory such as
dt=2013-02-15 and receive the same directory structure as output?

Thanks,
Max Lebedev



--
View this message in context: 
http://hadoop.6.n7.nabble.com/Running-hadoop-on-directory-structure-tp67904.html
Sent from the common-user mailing list archive at Nabble.com.

Running hadoop on directory structure

Reply via email to