Re: Subdirectory question revisited

2009-06-04 Thread Ian Soboroff
Here's how I solved the problem using a custom InputFormat... the key part is in listStatus(), where we traverse the directory tree. Since HDFS doesn't have links this code is probably safe, but if you have a filesystem with cycles you will get trapped. Ian import java.io.IOException; import

Re: Subdirectory question revisited

2009-06-03 Thread David Rosenstrauch
OK, thanks for the pointer. If I wind up rolling our own code to handle this I'll make sure to contribute it. DR Aaron Kimball wrote: There is no technical limit that prevents Hadoop from operating in this fashion; it's simply the case that the included InputFormat implementations do not do

Subdirectory question revisited

2009-06-02 Thread David Rosenstrauch
As per a previous list question (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e) it looks as though it's not possible for hadoop to traverse input directories recursively in order to discover input files.

Re: Subdirectory question revisited

2009-06-02 Thread Aaron Kimball
There is no technical limit that prevents Hadoop from operating in this fashion; it's simply the case that the included InputFormat implementations do not do so. This behavior has been set in this fashion for a long time, so it's unlikely that it will change soon, as that might break existing

Re: Subdirectory question revisited

2009-06-02 Thread Brian Bockelman
Hey Aaron, I had a similar problem. I have log files arranged in the following fashion: /logs/hostname/date.log I want to analyze a range of dates for all hosts. What I did was write into my driver class a subroutine that descends through the HDFS file system starting at /logs and

Re: Subdirectory question revisited

2009-06-02 Thread Brian Bockelman
Hey Aaron, I had a similar problem. I have log files arranged in the following fashion: /logs/hostname/date.log I want to analyze a range of dates for all hosts. What I did was write into my driver class a subroutine that descends through the HDFS file system starting at /logs and

Re: Subdirectory question revisited

2009-06-02 Thread Brian Bockelman
Hey Aaron, I had a similar problem. I have log files arranged in the following fashion: /logs/hostname/date.log I want to analyze a range of dates for all hosts. What I did was write into my driver class a subroutine that descends through the HDFS file system starting at /logs and