> > IndexingFilters and ParseFilters work on individual documents and cannot > return a collection, which may be a good idea. >
it can be done in the parsing getParse() returns a ParseResult object which is a map <Text, Parse>. This is used e.g. by FeedParser to generate X docs from a single content. Michael - you could have a custom HTMLParseFilter that gets the sentences from the text using a NLP library (e.g. opennlp) and generates one doc per sentence. You'd need to find a way filter out the original document so that it does not get indexed. HTH Julien On 31 October 2011 16:29, Markus Jelsma <[email protected]> wrote: > Hi, > > IndexingFilters and ParseFilters work on individual documents and cannot > return a collection, which may be a good idea. > > Assuming you have some sentence extractor at hand you could hack > SolrWriter.java. > > Cheers, > > > On Monday 31 October 2011 17:22:16 Michael Camilleri wrote: > > Hi all, > > > > Is it possible to get Nutch to split the crawl results into sentences > > so that each document contains only one sentence rather than a web > > page? I need this so that when I use Solr to index the crawl db it > > takes in a sentence at a time - the final result I want is to get a > > list of sentences that match a query instead of a list of web pages > > when doing a search. > > > > Thanks, > > Michael > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

