Do you use different load functions for different files ?
You can check out ALLLoader load function, which was added to piggybank
in 0.9 release -
http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/piggybank/storage/AllLoader.html
It can dynamically associate load functions with file extensions, may be
it can do that for file names as well.
It is also capable of getting information from information from the
directory names, which might be useful in your case.
Thanks,
Thejas
On 8/10/11 4:06 PM, SRINIVAS SURASANI wrote:
Hi,
we are getting data in following formats into hdfs.
yyyy/mm/dd/hh/mm/abc.xml
/bcd.xml
..
.. 4 more
each xml file has diff schema.
eg:eg: kfs/2011/05/1/9/01/abc.xml
/bcd.xml .... 4more
kfs/2011/05/1/9/02/abc.xml
/bcd.xml .. 4more
For each minute we get 6 diff kinds xml data files and total of 5 hours.I
have written 6pig codes to process these xml files(to get into CSV format).
Processing each minute data is straight forward( just has to mention one
input path and one output path in pig scripts).
We are looking to process say 10minutes of data as a batch and then other
10minutes ..so on until last 10minutes of day and here I was wondering about
specifying input path and output path for pigscript dynamically for each
batch(ecah 10minutes) of data. manually param substitution will work. but ,
looking for method in a way that input and output paths are changed
dynamically.
Any help greatly appreciated.
Thanks,
Srinivas