I have a use case in which we want to separate pages which have an iframe embed tag from youtube. and add it as a additional field for indexing.
I am using apache Nutch 1.8 with Solr 4.8 What I have done so far is to over-ride the "parse-html" plugin and identify iframe tags with youtube urls in ComContentUtils.getTextHelper() and append it in "content" with some special tags I then receive the content in an Custom Indexing filter plugin to extract the urls from the content and add it as a new field in NutchDocument. Is there a better way to do this? -- -Alan Francis

