Hi

In our case it is really in the segment, and ends up in the index. Are there 
any known issues with parse filters? In that filter we do set the Parse object 
as class attribute but we reset it with the new Parse object right after 
filter() is called.

I also cannot think of the custom Tika ContentHandler to be the issue, a new 
ContentHandler is created for each parse and passed to the TeeContentHandler, 
just all other ContentHandlers.

I assume an individual parse is completely isolated from another because all 
those objects are created new for each record.

Does anyone have a clue, however slight? Or any general tips on this, or how to 
attempt to reproduce it?


Thanks 
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Fri 30-Nov-2012 21:04
> To: user@nutch.apache.org
> Subject: Re: Wrong ParseData in segment
> 
> Hi Markus,
> 
> sounds somewhat similar to NUTCH-1252 but that was rather trivial
> and easy to reproduce.
> 
> Sebastian
> 
> 2012/11/30 Markus Jelsma <markus.jel...@openindex.io>:
> > Hi,
> >
> > We've got an issue where one in a few thousand records partially contains 
> > another record's ParseMeta data. To be specific, record A ends up with the 
> > ParseMeta data of record B that is added by one of our custom parse 
> > plugins. I'm unsure as to where the problem really is because the parse 
> > plugin receives data from a modified parser plugin that in turn adds a 
> > custom Tika ContentHandler.
> >
> > Because i'm unable to reproduce this i had to inspect the code for places 
> > where an object is reused but an attribute is not reset. To me, that would 
> > be the most obvious problem, but until now i've been unsuccessful in 
> > finding the issue!
> >
> > Regardless of how remote the chance is of someone having had some similar 
> > issue: does anyone have some ideas to share?
> >
> > Thanks,
> > Markus
> 

Reply via email to