Hi Markus,

right now I have seen this problem in a small test set of 20 documents:
- various document types (HTML, PDF, XLS, zip, doc, ods)
- small and quite large docs (up to 12 MB)
- local docs via protocol-file
- fetcher.parse = true
- Nutch 1.4, local mode

Somehow metadata from a one doc slipped into another doc:
- extracted by a custom HtmlParseFilter plugin (author, keywords, description)
- reproducible, though not easily (3-5 trials to get one, rarely two
wrong meta fields)
- wrong parsemeta is definitely in the segment

After adding more and more debug logs the "stupid" answer is:
the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-):
the same instance of a plugin may process two documents in parallel.
I found also this thread (and NUTCH-496):
  
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html
I didn't find any hint in the wiki (eg. in
http://wiki.apache.org/nutch/WritingPluginExample),
but I'll add one.

Cheers,
Sebastian


2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> Hi
>
> In our case it is really in the segment, and ends up in the index. Are there 
> any known issues with parse filters? In that filter we do set the Parse 
> object as class attribute but we reset it with the new Parse object right 
> after filter() is called.
>
> I also cannot think of the custom Tika ContentHandler to be the issue, a new 
> ContentHandler is created for each parse and passed to the TeeContentHandler, 
> just all other ContentHandlers.
>
> I assume an individual parse is completely isolated from another because all 
> those objects are created new for each record.
>
> Does anyone have a clue, however slight? Or any general tips on this, or how 
> to attempt to reproduce it?
>
>
> Thanks
>
> -----Original message-----
>> From:Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: Fri 30-Nov-2012 21:04
>> To: user@nutch.apache.org
>> Subject: Re: Wrong ParseData in segment
>>
>> Hi Markus,
>>
>> sounds somewhat similar to NUTCH-1252 but that was rather trivial
>> and easy to reproduce.
>>
>> Sebastian
>>
>> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
>> > Hi,
>> >
>> > We've got an issue where one in a few thousand records partially contains 
>> > another record's ParseMeta data. To be specific, record A ends up with the 
>> > ParseMeta data of record B that is added by one of our custom parse 
>> > plugins. I'm unsure as to where the problem really is because the parse 
>> > plugin receives data from a modified parser plugin that in turn adds a 
>> > custom Tika ContentHandler.
>> >
>> > Because i'm unable to reproduce this i had to inspect the code for places 
>> > where an object is reused but an attribute is not reset. To me, that would 
>> > be the most obvious problem, but until now i've been unsuccessful in 
>> > finding the issue!
>> >
>> > Regardless of how remote the chance is of someone having had some similar 
>> > issue: does anyone have some ideas to share?
>> >
>> > Thanks,
>> > Markus
>>

Reply via email to