Re: Split web pages into sentences

Julien Nioche Mon, 31 Oct 2011 09:36:30 -0700

>
> IndexingFilters and ParseFilters work on individual documents and cannot
> return a collection, which may be a good idea.
>

it can be done in the parsing

getParse() returns a ParseResult object which is a map <Text, Parse>. This
is used e.g. by FeedParser to generate X docs from a single content.

Michael - you could have a custom HTMLParseFilter that gets the sentences
from the text using a NLP library (e.g. opennlp) and generates one doc per
sentence. You'd need to find a way filter out the original document so that
it does not get indexed.

HTH

Julien

On 31 October 2011 16:29, Markus Jelsma <[email protected]> wrote:

> Hi,
>
> IndexingFilters and ParseFilters work on individual documents and cannot
> return a collection, which may be a good idea.
>
> Assuming you have some sentence extractor at hand you could hack
> SolrWriter.java.
>
> Cheers,
>
>
> On Monday 31 October 2011 17:22:16 Michael Camilleri wrote:
> > Hi all,
> >
> > Is it possible to get Nutch to split the crawl results into sentences
> > so that each document contains only one sentence rather than a web
> > page? I need this so that when I use Solr to index the crawl db it
> > takes in a sentence at a time - the final result I want is to get a
> > list of sentences that match a query instead of a list of web pages
> > when doing a search.
> >
> > Thanks,
> > Michael
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Split web pages into sentences

Reply via email to