Sebastian! I thought about that too since i do sometimes use class variables in some parse plugins such as storing the Parse object. However, i assumed the plugins were already in a thread-safe environment because each FetcherThread instance has it's own instance of ParseUtil.
I'll modify the plugins and see if it helps ;) Thanks, Markus -----Original message----- > From:Sebastian Nagel <wastl.na...@googlemail.com> > Sent: Wed 16-Jan-2013 18:38 > To: user@nutch.apache.org > Subject: Re: Wrong ParseData in segment > > Hi Markus, > > right now I have seen this problem in a small test set of 20 documents: > - various document types (HTML, PDF, XLS, zip, doc, ods) > - small and quite large docs (up to 12 MB) > - local docs via protocol-file > - fetcher.parse = true > - Nutch 1.4, local mode > > Somehow metadata from a one doc slipped into another doc: > - extracted by a custom HtmlParseFilter plugin (author, keywords, description) > - reproducible, though not easily (3-5 trials to get one, rarely two > wrong meta fields) > - wrong parsemeta is definitely in the segment > > After adding more and more debug logs the "stupid" answer is: > the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-): > the same instance of a plugin may process two documents in parallel. > I found also this thread (and NUTCH-496): > > http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html > I didn't find any hint in the wiki (eg. in > http://wiki.apache.org/nutch/WritingPluginExample), > but I'll add one. > > Cheers, > Sebastian > > > 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>: > > Hi > > > > In our case it is really in the segment, and ends up in the index. Are > > there any known issues with parse filters? In that filter we do set the > > Parse object as class attribute but we reset it with the new Parse object > > right after filter() is called. > > > > I also cannot think of the custom Tika ContentHandler to be the issue, a > > new ContentHandler is created for each parse and passed to the > > TeeContentHandler, just all other ContentHandlers. > > > > I assume an individual parse is completely isolated from another because > > all those objects are created new for each record. > > > > Does anyone have a clue, however slight? Or any general tips on this, or > > how to attempt to reproduce it? > > > > > > Thanks > > > > -----Original message----- > >> From:Sebastian Nagel <wastl.na...@googlemail.com> > >> Sent: Fri 30-Nov-2012 21:04 > >> To: user@nutch.apache.org > >> Subject: Re: Wrong ParseData in segment > >> > >> Hi Markus, > >> > >> sounds somewhat similar to NUTCH-1252 but that was rather trivial > >> and easy to reproduce. > >> > >> Sebastian > >> > >> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>: > >> > Hi, > >> > > >> > We've got an issue where one in a few thousand records partially > >> > contains another record's ParseMeta data. To be specific, record A ends > >> > up with the ParseMeta data of record B that is added by one of our > >> > custom parse plugins. I'm unsure as to where the problem really is > >> > because the parse plugin receives data from a modified parser plugin > >> > that in turn adds a custom Tika ContentHandler. > >> > > >> > Because i'm unable to reproduce this i had to inspect the code for > >> > places where an object is reused but an attribute is not reset. To me, > >> > that would be the most obvious problem, but until now i've been > >> > unsuccessful in finding the issue! > >> > > >> > Regardless of how remote the chance is of someone having had some > >> > similar issue: does anyone have some ideas to share? > >> > > >> > Thanks, > >> > Markus > >> >