Sebastian!

I thought about that too since i do sometimes use class variables in some parse 
plugins such as storing the Parse object. However, i assumed the plugins were 
already in a thread-safe environment because each FetcherThread instance has 
it's own instance of ParseUtil. 

I'll modify the plugins and see if it helps ;)

Thanks,
Markus 
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Wed 16-Jan-2013 18:38
> To: user@nutch.apache.org
> Subject: Re: Wrong ParseData in segment
> 
> Hi Markus,
> 
> right now I have seen this problem in a small test set of 20 documents:
> - various document types (HTML, PDF, XLS, zip, doc, ods)
> - small and quite large docs (up to 12 MB)
> - local docs via protocol-file
> - fetcher.parse = true
> - Nutch 1.4, local mode
> 
> Somehow metadata from a one doc slipped into another doc:
> - extracted by a custom HtmlParseFilter plugin (author, keywords, description)
> - reproducible, though not easily (3-5 trials to get one, rarely two
> wrong meta fields)
> - wrong parsemeta is definitely in the segment
> 
> After adding more and more debug logs the "stupid" answer is:
> the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-):
> the same instance of a plugin may process two documents in parallel.
> I found also this thread (and NUTCH-496):
>   
> http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html
> I didn't find any hint in the wiki (eg. in
> http://wiki.apache.org/nutch/WritingPluginExample),
> but I'll add one.
> 
> Cheers,
> Sebastian
> 
> 
> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> > Hi
> >
> > In our case it is really in the segment, and ends up in the index. Are 
> > there any known issues with parse filters? In that filter we do set the 
> > Parse object as class attribute but we reset it with the new Parse object 
> > right after filter() is called.
> >
> > I also cannot think of the custom Tika ContentHandler to be the issue, a 
> > new ContentHandler is created for each parse and passed to the 
> > TeeContentHandler, just all other ContentHandlers.
> >
> > I assume an individual parse is completely isolated from another because 
> > all those objects are created new for each record.
> >
> > Does anyone have a clue, however slight? Or any general tips on this, or 
> > how to attempt to reproduce it?
> >
> >
> > Thanks
> >
> > -----Original message-----
> >> From:Sebastian Nagel <wastl.na...@googlemail.com>
> >> Sent: Fri 30-Nov-2012 21:04
> >> To: user@nutch.apache.org
> >> Subject: Re: Wrong ParseData in segment
> >>
> >> Hi Markus,
> >>
> >> sounds somewhat similar to NUTCH-1252 but that was rather trivial
> >> and easy to reproduce.
> >>
> >> Sebastian
> >>
> >> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> >> > Hi,
> >> >
> >> > We've got an issue where one in a few thousand records partially 
> >> > contains another record's ParseMeta data. To be specific, record A ends 
> >> > up with the ParseMeta data of record B that is added by one of our 
> >> > custom parse plugins. I'm unsure as to where the problem really is 
> >> > because the parse plugin receives data from a modified parser plugin 
> >> > that in turn adds a custom Tika ContentHandler.
> >> >
> >> > Because i'm unable to reproduce this i had to inspect the code for 
> >> > places where an object is reused but an attribute is not reset. To me, 
> >> > that would be the most obvious problem, but until now i've been 
> >> > unsuccessful in finding the issue!
> >> >
> >> > Regardless of how remote the chance is of someone having had some 
> >> > similar issue: does anyone have some ideas to share?
> >> >
> >> > Thanks,
> >> > Markus
> >>
> 

Reply via email to