I have a use case in which we want to separate pages which have an iframe
embed tag from youtube. and add it as a additional field for indexing.

I am using apache Nutch 1.8 with Solr 4.8

What I have done so far is to over-ride the "parse-html" plugin and
identify iframe tags with youtube urls in ComContentUtils.getTextHelper()
and append it in "content" with some special tags

I then receive the content in an Custom Indexing filter plugin to extract
the urls from the content and add it as a new field in NutchDocument.

Is there a better way to do this?



-- 
-Alan Francis

Reply via email to