Here's how I solved the problem using a custom InputFormat... the key
part is in listStatus(), where we traverse the directory tree. Since
HDFS doesn't have links this code is probably safe, but if you have a
filesystem with cycles you will get trapped.
Ian
import java.io.IOException;
import
OK, thanks for the pointer.
If I wind up rolling our own code to handle this I'll make sure to
contribute it.
DR
Aaron Kimball wrote:
There is no technical limit that prevents Hadoop from operating in this
fashion; it's simply the case that the included InputFormat implementations
do not do
As per a previous list question
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e)
it looks as though it's not possible for hadoop to traverse input
directories recursively in order to discover input files.
There is no technical limit that prevents Hadoop from operating in this
fashion; it's simply the case that the included InputFormat implementations
do not do so. This behavior has been set in this fashion for a long time, so
it's unlikely that it will change soon, as that might break existing
Hey Aaron,
I had a similar problem. I have log files arranged in the following
fashion:
/logs/hostname/date.log
I want to analyze a range of dates for all hosts. What I did was
write into my driver class a subroutine that descends through the HDFS
file system starting at /logs and
Hey Aaron,
I had a similar problem. I have log files arranged in the following
fashion:
/logs/hostname/date.log
I want to analyze a range of dates for all hosts. What I did was
write into my driver class a subroutine that descends through the HDFS
file system starting at /logs and
Hey Aaron,
I had a similar problem. I have log files arranged in the following
fashion:
/logs/hostname/date.log
I want to analyze a range of dates for all hosts. What I did was
write into my driver class a subroutine that descends through the HDFS
file system starting at /logs and