I’ve done something similar, not with iframes but with other custom needed 
elements, but the logic will apply. Implement a custom HtmlParseFilter and a 
IndexingFilter, this way you could control how you want the data to be indexed. 
But you’re on a right track, perhaps not overriding parse-html, but 
implementing a new plugin just for your logic.

Greetings!

On May 27, 2014, at 9:46 AM, Alan Francis <[email protected]> wrote:

> I have a use case in which we want to separate pages which have an iframe
> embed tag from youtube. and add it as a additional field for indexing.
> 
> I am using apache Nutch 1.8 with Solr 4.8
> 
> What I have done so far is to over-ride the "parse-html" plugin and
> identify iframe tags with youtube urls in ComContentUtils.getTextHelper()
> and append it in "content" with some special tags
> 
> I then receive the content in an Custom Indexing filter plugin to extract
> the urls from the content and add it as a new field in NutchDocument.
> 
> Is there a better way to do this?
> 
> 
> 
> -- 
> -Alan Francis

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Reply via email to