Hi Markus, right now I have seen this problem in a small test set of 20 documents: - various document types (HTML, PDF, XLS, zip, doc, ods) - small and quite large docs (up to 12 MB) - local docs via protocol-file - fetcher.parse = true - Nutch 1.4, local mode
Somehow metadata from a one doc slipped into another doc: - extracted by a custom HtmlParseFilter plugin (author, keywords, description) - reproducible, though not easily (3-5 trials to get one, rarely two wrong meta fields) - wrong parsemeta is definitely in the segment After adding more and more debug logs the "stupid" answer is: the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-): the same instance of a plugin may process two documents in parallel. I found also this thread (and NUTCH-496): http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html I didn't find any hint in the wiki (eg. in http://wiki.apache.org/nutch/WritingPluginExample), but I'll add one. Cheers, Sebastian 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>: > Hi > > In our case it is really in the segment, and ends up in the index. Are there > any known issues with parse filters? In that filter we do set the Parse > object as class attribute but we reset it with the new Parse object right > after filter() is called. > > I also cannot think of the custom Tika ContentHandler to be the issue, a new > ContentHandler is created for each parse and passed to the TeeContentHandler, > just all other ContentHandlers. > > I assume an individual parse is completely isolated from another because all > those objects are created new for each record. > > Does anyone have a clue, however slight? Or any general tips on this, or how > to attempt to reproduce it? > > > Thanks > > -----Original message----- >> From:Sebastian Nagel <wastl.na...@googlemail.com> >> Sent: Fri 30-Nov-2012 21:04 >> To: user@nutch.apache.org >> Subject: Re: Wrong ParseData in segment >> >> Hi Markus, >> >> sounds somewhat similar to NUTCH-1252 but that was rather trivial >> and easy to reproduce. >> >> Sebastian >> >> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>: >> > Hi, >> > >> > We've got an issue where one in a few thousand records partially contains >> > another record's ParseMeta data. To be specific, record A ends up with the >> > ParseMeta data of record B that is added by one of our custom parse >> > plugins. I'm unsure as to where the problem really is because the parse >> > plugin receives data from a modified parser plugin that in turn adds a >> > custom Tika ContentHandler. >> > >> > Because i'm unable to reproduce this i had to inspect the code for places >> > where an object is reused but an attribute is not reset. To me, that would >> > be the most obvious problem, but until now i've been unsuccessful in >> > finding the issue! >> > >> > Regardless of how remote the chance is of someone having had some similar >> > issue: does anyone have some ideas to share? >> > >> > Thanks, >> > Markus >>