Hi Ramdev,

>>Is tehre other functionality (experimental or otherwise) within ES that 
can help me do this ?

I'd recommend splitting HTML files that are clearly referencing multiple 
diverse news stories into multiple ES documents based on title headings or 
whatever indicates the start/end of each news item.

For boilerplate-removal I have previously used this analyzer on an earlier 
incarnation of the significant_terms algo: 
 https://issues.apache.org/jira/browse/LUCENE-725

Cheers
Mark

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ae098032-ac92-4de3-a0f5-681d3b4c1031%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to