Hello everybody,
I am working on Terrier (www.terrier.org) an IR toolkit that leverages
hadoop for indexing large amount of data (ie documents). I am working
both local with a small subset of the whole dataset and on amazon EC2
with the full size dataset. I am experiencing a weird (at least to me)
exception which occurs always at 66% of the map phase. Here's the log
http://pastebin.com/XtUkHFYE. I really have no idea where the problem
could be.
>From the original Terrier3.5 I've only modified the inputformat which
is used to read the collection of document: I use a custom
sequencefileinputformat in order to process a custom sequence file
made up of all the tiny documents of the trec collection (a standard
document collection used in IR).
I guess the problem is not here since even using unmodified version of
terrier I get the same error. In that case, however, there is no
failure maybe because the authors of terrier use MultiFileCollection.

I'd love to hear from somebody since when running the indexing job on
the whole dataset the jobs fails because this error happens more than
once. In pseudo mode, after a failure the job is completed ... on the
cloud it isn't.

Thanks for your time

Marco Didonna

PS: I use both locally and on the cloud latest version of cloudera
distribution for hadoop

Reply via email to