Hi Sebastian,

Makes sense, i'll be sure to modify the parser plugins. Perhaps it would be 
worth trying to make sure a single thread uses a single instance. I don't know 
why it works the way it does. Judging from the pointed thread it's intended 
behaviour.

On the other side, reusing parser plugins the way it's now doesn't make too 
much sense. There's usually not a huge amount of data involved per single 
instance so conserving heap space doesn't seem a reasonable justification.

Thanks,
Markus

 
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Wed 16-Jan-2013 22:04
> To: user@nutch.apache.org
> Subject: Re: Wrong ParseData in segment
> 
> Hi Markus,
> 
> > However, i assumed the plugins were already in a thread-safe environment 
> > because each
> > FetcherThread instance has it's own instance of ParseUtil.
> I had similar assumptions but the debug output to investigate my problem is 
> straightforward
> (the number are object hash codes):
> 
> 2013-01-16 17:04:29,386 DEBUG parse.CustomParseFilter (instance=1639291161): 
> parsing file:.../1.xls
> 2013-01-16 17:04:29,452 DEBUG parse.CustomParseFilter (instance=1639291161): 
> parsing file:.../2.doc
> 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: 
> node meta elem = 598132191
> 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: 
> author=Christina Maier
> 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: 
> node meta elem = 598132191
> 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: 
> author=Christina Maier
> 
> The same parse filter instance processes two documents in parallel. The 
> plugin does a lot
> (extracting metadata, pruning content) and the documents are large and take 
> some time to process.
> Via a shared instance variable references to DOM nodes slipped from one call 
> of filter() to the other.
> 
> Is there a possibility to ensure that every instance of ParseUtil has it's 
> own plugin instances?
> Would be worth to check.
> 
> Cheers,
> Sebastian
> 
> 
> On 01/16/2013 06:55 PM, Markus Jelsma wrote:
> > Sebastian!
> > 
> > I thought about that too since i do sometimes use class variables in some 
> > parse plugins such as storing the Parse object. However, i assumed the 
> > plugins were already in a thread-safe environment because each 
> > FetcherThread instance has it's own instance of ParseUtil. 
> > 
> > I'll modify the plugins and see if it helps ;)
> > 
> > Thanks,
> > Markus 
> >  
> > -----Original message-----
> >> From:Sebastian Nagel <wastl.na...@googlemail.com>
> >> Sent: Wed 16-Jan-2013 18:38
> >> To: user@nutch.apache.org
> >> Subject: Re: Wrong ParseData in segment
> >>
> >> Hi Markus,
> >>
> >> right now I have seen this problem in a small test set of 20 documents:
> >> - various document types (HTML, PDF, XLS, zip, doc, ods)
> >> - small and quite large docs (up to 12 MB)
> >> - local docs via protocol-file
> >> - fetcher.parse = true
> >> - Nutch 1.4, local mode
> >>
> >> Somehow metadata from a one doc slipped into another doc:
> >> - extracted by a custom HtmlParseFilter plugin (author, keywords, 
> >> description)
> >> - reproducible, though not easily (3-5 trials to get one, rarely two
> >> wrong meta fields)
> >> - wrong parsemeta is definitely in the segment
> >>
> >> After adding more and more debug logs the "stupid" answer is:
> >> the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-):
> >> the same instance of a plugin may process two documents in parallel.
> >> I found also this thread (and NUTCH-496):
> >>   
> >> http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html
> >> I didn't find any hint in the wiki (eg. in
> >> http://wiki.apache.org/nutch/WritingPluginExample),
> >> but I'll add one.
> >>
> >> Cheers,
> >> Sebastian
> >>
> >>
> >> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> >>> Hi
> >>>
> >>> In our case it is really in the segment, and ends up in the index. Are 
> >>> there any known issues with parse filters? In that filter we do set the 
> >>> Parse object as class attribute but we reset it with the new Parse object 
> >>> right after filter() is called.
> >>>
> >>> I also cannot think of the custom Tika ContentHandler to be the issue, a 
> >>> new ContentHandler is created for each parse and passed to the 
> >>> TeeContentHandler, just all other ContentHandlers.
> >>>
> >>> I assume an individual parse is completely isolated from another because 
> >>> all those objects are created new for each record.
> >>>
> >>> Does anyone have a clue, however slight? Or any general tips on this, or 
> >>> how to attempt to reproduce it?
> >>>
> >>>
> >>> Thanks
> >>>
> >>> -----Original message-----
> >>>> From:Sebastian Nagel <wastl.na...@googlemail.com>
> >>>> Sent: Fri 30-Nov-2012 21:04
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: Wrong ParseData in segment
> >>>>
> >>>> Hi Markus,
> >>>>
> >>>> sounds somewhat similar to NUTCH-1252 but that was rather trivial
> >>>> and easy to reproduce.
> >>>>
> >>>> Sebastian
> >>>>
> >>>> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> >>>>> Hi,
> >>>>>
> >>>>> We've got an issue where one in a few thousand records partially 
> >>>>> contains another record's ParseMeta data. To be specific, record A ends 
> >>>>> up with the ParseMeta data of record B that is added by one of our 
> >>>>> custom parse plugins. I'm unsure as to where the problem really is 
> >>>>> because the parse plugin receives data from a modified parser plugin 
> >>>>> that in turn adds a custom Tika ContentHandler.
> >>>>>
> >>>>> Because i'm unable to reproduce this i had to inspect the code for 
> >>>>> places where an object is reused but an attribute is not reset. To me, 
> >>>>> that would be the most obvious problem, but until now i've been 
> >>>>> unsuccessful in finding the issue!
> >>>>>
> >>>>> Regardless of how remote the chance is of someone having had some 
> >>>>> similar issue: does anyone have some ideas to share?
> >>>>>
> >>>>> Thanks,
> >>>>> Markus
> >>>>
> >>
> 
> 

Reply via email to