I remember we did something similar before. FileSplit.getPath() does have a hold of file name.

Here is a sample code:

public class PigStorageWithInputPath extends PigStorage {
   Path path = null;

   @Override
   public void prepareToRead(RecordReader reader, PigSplit split) {
       super.prepareToRead(reader, split);
       path = ((FileSplit)split.getWrappedSplit()).getPath();
   }

   @Override
   public Tuple getNext() throws IOException {
       Tuple myTuple = super.getNext();
       if (myTuple != null)
           myTuple.append(path.toString());
       return myTuple;
   }
}


Does it solves your problem?

Daniel

Sangchul Song wrote:
Hi all,

Our dataset consists of multiple files. The name of each file reflects
the creation date of the file. (e.g. 20101031.dat, 20101101.dat, etc)
We need this date information for all relations inside the file, but
there is no date field.

We first considered the possibility of accessing the file name through
a UDF that implements LoadFunc, but it doesn't appear to be possible.
In particular, 'location' in setLocation(String location, PigSplit
split) only gives the original glob expression used in LOAD (such as
'/test/data/*.dat'), and 'reader' in prepareToRead(RecordReader
reader, PigSplit split) doesn't expose a method for file name access.

Before we individually add the date field to every single file (which
we want to leave as the last resort, considering the number of files
we deal with), we were wondering if there's any way to access the file
name within a pig script (including UDFs) especially when you load
multiple files at the same time. Any help would be greatly
appreciated.

FYI, we are on Pig 0.7.0 running on top of Hadoop 0.20.2

Thanks,

Sang

Reply via email to