Dear Eric, thanks for your answer but if the problem was a mismatch with the hadoop distribution used hadoop would have complained in a very specific way, saying "protocol mismatch, ...[..]" : been there :) . The problem was more subtle, and thanks to the extreme kind Vinod Kumar we figured out (using IRC chat) that the bug TR-111 was causing that NPE. The bug is marked as not important and trivial but to me was extremely important :)
Marco On 28 October 2011 14:47, Eric Fiala <e...@fiala.ca> wrote: > Marco, > I'm not familiar with terrier - however, I do notice that the download > package includes [ hadoop-0.20.2+228-core.jar ] - try changing that out for > the jar provided in the distribution. > If that doesn't fix it, look into the other jars provided (or make sure > the ones from your hadoop distro are being sourced prior to those) - your > error on pastebin feels alot like a slight version mismatch. > > hth > > EF > > On Thu, Oct 27, 2011 at 10:43 AM, Marco Didonna <m.didonn...@gmail.com>wrote: > >> Hello everybody, >> I am working on Terrier (www.terrier.org) an IR toolkit that leverages >> hadoop for indexing large amount of data (ie documents). I am working >> both local with a small subset of the whole dataset and on amazon EC2 >> with the full size dataset. I am experiencing a weird (at least to me) >> exception which occurs always at 66% of the map phase. Here's the log >> http://pastebin.com/XtUkHFYE. I really have no idea where the problem >> could be. >> From the original Terrier3.5 I've only modified the inputformat which >> is used to read the collection of document: I use a custom >> sequencefileinputformat in order to process a custom sequence file >> made up of all the tiny documents of the trec collection (a standard >> document collection used in IR). >> I guess the problem is not here since even using unmodified version of >> terrier I get the same error. In that case, however, there is no >> failure maybe because the authors of terrier use MultiFileCollection. >> >> I'd love to hear from somebody since when running the indexing job on >> the whole dataset the jobs fails because this error happens more than >> once. In pseudo mode, after a failure the job is completed ... on the >> cloud it isn't. >> >> Thanks for your time >> >> Marco Didonna >> >> PS: I use both locally and on the cloud latest version of cloudera >> distribution for hadoop >> > > > > -- > *Eric Fiala* > *Fiala Consulting* > T: 403.828.1117 > E: e...@fiala.ca > http://www.fiala.ca > >