Hi everyone,
I have written my custom parser and since my files are sm,all I am using
sequence file for efficiency. Each file in the equence file has info about one
user and I am parsing that file and I would like to get a bag of tuples for
every user/file/. In my Parser class I have implemented exec function that
will be called for each file/user. I then gather the info and package it as
tuples. Each user will generate multiple tuples sine the file is quite rich and
complex. Is it correct to assume that the the relation AU will contact one bag
per user?
When I execute the following script, I get the following error. Any help with
this would be great!
ERROR 1031: Incompatable field schema: declared is
"bag_0:bag{:tuple(id:int,class:chararray,name:chararray,begin:int,end:int,probone:chararray,probtwo:chararray)}",
infered is ":Unknown"
Java UDF code snippet
PopulateBag
{
for (MyItems item : items)
{
Tuple output = TupleFactory.getInstance().newTuple(7);
output.set(0, item.getId());
output.set(1, item.getClass());
output.set(2,item.getName());
output.set(3,item.Begin());
output.set(4,item.End());
output.set(5,item.Probabilityone());
output.set(6,item.Probtwo());
m_defaultDataBag.add(output);
}
}
public DefaultDataBag exec(Tuple input) throws IOException {
try
{
this.ParseFile((String)input.get(0));
this.PopulateBag();
return m_defaultDataBag;
} catch (Exception e) {
System.err.println("Failed to process th i/p \n");
return null;
}
}
Pig Script
REGISTER
/users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
REGISTER /users/p529444/software/pig-0.11.1/parser.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
value: chararray);
DESCRIBE A;
STORE A into '/scratch/A';
AU = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray,
name: chararray, begin: int, end: int, probone: chararray, probtwo: chararray)};