Re: Subdirectory question revisited

Brian Bockelman Tue, 02 Jun 2009 17:36:06 -0700

Hey Aaron,

I had a similar problem. I have log files arranged in the followingfashion:


/logs/<hostname>/<date>.log

I want to analyze a range of dates for all hosts. What I did waswrite into my driver class a subroutine that descends through the HDFSfile system starting at /logs and builds a list of input files, thenfed the list of files to the framework.


Example code below.

Brian

    FileSystem fs = FileSystem.get(conf);

Pattern fileNamePattern = Pattern.compile(".*datanode-(.*).log.([0-9]+-[0-9]+-[0-9]+)");

    for (FileStatus status : fs.listStatus(base)) {
      Path pathname = status.getPath();
      for (FileStatus logfile : fs.listStatus(pathname)) {
        Path logFilePath = logfile.getPath();
        Matcher m = fileNamePattern.matcher(logFilePath.getName());
        if (m.matches()) {
          String dateString = m.group(2);
          Date logDate = df.parse(dateString);

if ((logDate.equals(startDate) || logDate.after(startDate))&& logDate.before(endDate)) {

            FileInputFormat.addInputPath(conf, logFilePath);
          } else {

//System.out.println("Ignoring file: " +logFilePath.getName());//System.out.println("Start Date: " + startDate + ", EndDate: " + endDate + ", Log date: " + logDate);

          }
        } else {

System.out.println("Ignoring file: " +logFilePath.getName());

        }
      }
    }


On Jun 2, 2009, at 6:22 PM, Aaron Kimball wrote:

There is no technical limit that prevents Hadoop from operating inthisfashion; it's simply the case that the included InputFormatimplementationsdo not do so. This behavior has been set in this fashion for a longtime, so
it's unlikely that it will change soon, as that might break existing
applications.

But you can write your own subclass of TextInputFormat or
SequenceFileInputFormat that overrides the getSplits() method torecursively
descend through directories and search for files.

- Aaron
On Tue, Jun 2, 2009 at 1:22 PM, David Rosenstrauch<dar...@darose.net>wrote:
As per a previous list question (
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e)
it looks as though it's not possible for hadoop to traverse input
directories recursively in order to discover input files.
Just wondering a) if there's any particular reason why thisfunctionalitydoesn't exist, and b) if not, if there's any workaround/hack tomake it
possible.
Like the OP, I was thinking it would be helpful to partition myinput databy year, month, and day. I figured his would enable me to run jobsagainstspecific date ranges of input data, and thereby speed up theexecution of my
jobs since they wouldn't have to process every single record.
Any way to make this happen? (Or am I totally going about this thewrong
way for what I'm trying to achieve?)

TIA,

DR

Re: Subdirectory question revisited

Reply via email to