Re: Access to file name?

Daniel Dai Tue, 23 Nov 2010 18:22:48 -0800

Sure.

Thanks


Dmitriy Ryaboy wrote:

Daniel,
Can you drop this on the wiki?

-D

On Tue, Nov 23, 2010 at 10:27 AM, Daniel Dai <jiany...@yahoo-inc.com> wrote:

I remember we did something similar before. FileSplit.getPath() does have a
hold of file name.

Here is a sample code:

public class PigStorageWithInputPath extends PigStorage {
  Path path = null;

  @Override
  public void prepareToRead(RecordReader reader, PigSplit split) {
      super.prepareToRead(reader, split);
      path = ((FileSplit)split.getWrappedSplit()).getPath();
  }

  @Override
  public Tuple getNext() throws IOException {
      Tuple myTuple = super.getNext();
      if (myTuple != null)
          myTuple.append(path.toString());
      return myTuple;
  }
}


Does it solves your problem?

Daniel


Sangchul Song wrote:

Hi all,

Our dataset consists of multiple files. The name of each file reflects
the creation date of the file. (e.g. 20101031.dat, 20101101.dat, etc)
We need this date information for all relations inside the file, but
there is no date field.

We first considered the possibility of accessing the file name through
a UDF that implements LoadFunc, but it doesn't appear to be possible.
In particular, 'location' in setLocation(String location, PigSplit
split) only gives the original glob expression used in LOAD (such as
'/test/data/*.dat'), and 'reader' in prepareToRead(RecordReader
reader, PigSplit split) doesn't expose a method for file name access.

Before we individually add the date field to every single file (which
we want to leave as the last resort, considering the number of files
we deal with), we were wondering if there's any way to access the file
name within a pig script (including UDFs) especially when you load
multiple files at the same time. Any help would be greatly
appreciated.

FYI, we are on Pig 0.7.0 running on top of Hadoop 0.20.2

Thanks,

Sang

Re: Access to file name?

Reply via email to