RE: Wrong ParseData in segment

2012-11-30 Thread Markus Jelsma
Hi In our case it is really in the segment, and ends up in the index. Are there any known issues with parse filters? In that filter we do set the Parse object as class attribute but we reset it with the new Parse object right after filter() is called. I also cannot think of the custom Tika Con

Re: Wrong ParseData in segment

2012-11-30 Thread Sebastian Nagel
Hi Markus, sounds somewhat similar to NUTCH-1252 but that was rather trivial and easy to reproduce. Sebastian 2012/11/30 Markus Jelsma : > Hi, > > We've got an issue where one in a few thousand records partially contains > another record's ParseMeta data. To be specific, record A ends up with t

Re: Access crawled content or parsed data of previous crawled url

2012-11-30 Thread Jorge Luis Betancourt Gonzalez
- Mensaje original - De: "Markus Jelsma" Para: user@nutch.apache.org Enviados: Viernes, 30 de Noviembre 2012 11:28:23 Asunto: RE: Access crawled content or parsed data of previous crawled url -Original message- > From:Jorge Luis Betancourt Gonzalez > Sent: Fri 30-Nov-2012 17:2

Wrong ParseData in segment

2012-11-30 Thread Markus Jelsma
Hi, We've got an issue where one in a few thousand records partially contains another record's ParseMeta data. To be specific, record A ends up with the ParseMeta data of record B that is added by one of our custom parse plugins. I'm unsure as to where the problem really is because the parse pl

RE: Access crawled content or parsed data of previous crawled url

2012-11-30 Thread Markus Jelsma
-Original message- > From:Jorge Luis Betancourt Gonzalez > Sent: Fri 30-Nov-2012 17:22 > To: user@nutch.apache.org > Subject: Re: Access crawled content or parsed data of previous crawled url > > > > > I was thinking in this a lot since yesterday and I realize that I really > don't

Re: Access crawled content or parsed data of previous crawled url

2012-11-30 Thread Jorge Luis Betancourt Gonzalez
- Mensaje original - De: "Markus Jelsma" Para: user@nutch.apache.org Enviados: Jueves, 29 de Noviembre 2012 17:13:32 Asunto: RE: Access crawled content or parsed data of previous crawled url -Original message- > From:Jorge Luis Betancourt Gonzalez > Sent: Thu 29-Nov-2012 22:51

Re: Fetch content inside nutch parse

2012-11-30 Thread Jorge Luis Betancourt Gonzalez
Thanks Markus this helps a lot! - Mensaje original - De: "Markus Jelsma" Para: user@nutch.apache.org Enviados: Viernes, 30 de Noviembre 2012 10:50:55 Asunto: RE: Fetch content inside nutch parse See how the indexchecker fetches URL's: http://svn.apache.org/viewvc/nutch/trunk/src/java/org

RE: Fetch content inside nutch parse

2012-11-30 Thread Markus Jelsma
See how the indexchecker fetches URL's: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java?view=markup -Original message- > From:Jorge Luis Betancourt Gonzalez > Sent: Fri 30-Nov-2012 16:46 > To: user@nutch.apache.org > Subject: Fetch