I asked a similar question before. Please see this thread http://mail-archives.apache.org/mod_mbox/pig-user/201103.mbox/%[email protected]%3E
Shawn On Tue, May 31, 2011 at 11:08 AM, Jonathan Coveney <[email protected]> wrote: > Context: I have a bunch of files living in HDFS, and I think my jobs are > failing on one of them... I want to output the files that the job is failing > on. > > I thought that I could just make my own LoadFunc that followed the same > methodology as PigStorage, but caught exceptions and logged the file that > was given...this isn't working, however. I tried returning loadLocation, but > that is the globbed input, not the input to the mapper. I also tried reading > mapreduce.map.file.input and map.file.input from the Job given to > setLocation, but both were null... I think this is where some of my > ignorance as to pig's internal workings is coming into play, as I'm not sure > when files are deglobbed and the splits are actually read. I tried using > getLocations() from the PigSplit passed to prepareToRead but that was just > the glob as well... > > My next thought would be to read make a RecordReader that outputs the file > associated with its splits (as I assume that this should have to have the > specific files it is processing?), but I thought I'd ask if there was a > cleaner way before doing that... > > Thanks! > Jon >
