Hello everybody, I am working on Terrier (www.terrier.org) an IR toolkit that leverages hadoop for indexing large amount of data (ie documents). I am working both local with a small subset of the whole dataset and on amazon EC2 with the full size dataset. I am experiencing a weird (at least to me) exception which occurs always at 66% of the map phase. Here's the log http://pastebin.com/XtUkHFYE. I really have no idea where the problem could be. >From the original Terrier3.5 I've only modified the inputformat which is used to read the collection of document: I use a custom sequencefileinputformat in order to process a custom sequence file made up of all the tiny documents of the trec collection (a standard document collection used in IR). I guess the problem is not here since even using unmodified version of terrier I get the same error. In that case, however, there is no failure maybe because the authors of terrier use MultiFileCollection.
I'd love to hear from somebody since when running the indexing job on the whole dataset the jobs fails because this error happens more than once. In pseudo mode, after a failure the job is completed ... on the cloud it isn't. Thanks for your time Marco Didonna PS: I use both locally and on the cloud latest version of cloudera distribution for hadoop