I've got the HtmlParserFilter and the IndexingFilter down. I added the video links in the metatags, and I extracted it to be added into the NutchDocument as a new field.
I've to call an API(not http) to push the data to Solr. So I gotta write a IndexWriter plugin for this. But I noticed that the IndexWriter plugin only takes NutchDocument as input. That means I gotta add the content-metadata and Parse metadata from the Parse object into the NutchDocument in the IndexingFilter, if I want to index meta tags? Or is there another way to do this. On Wed, May 28, 2014 at 12:14 AM, Jorge Luis Betancourt Gonzalez < [email protected]> wrote: > I’ve done something similar, not with iframes but with other custom needed > elements, but the logic will apply. Implement a custom HtmlParseFilter and > a IndexingFilter, this way you could control how you want the data to be > indexed. But you’re on a right track, perhaps not overriding parse-html, > but implementing a new plugin just for your logic. > > Greetings! > > On May 27, 2014, at 9:46 AM, Alan Francis <[email protected]> wrote: > > > I have a use case in which we want to separate pages which have an iframe > > embed tag from youtube. and add it as a additional field for indexing. > > > > I am using apache Nutch 1.8 with Solr 4.8 > > > > What I have done so far is to over-ride the "parse-html" plugin and > > identify iframe tags with youtube urls in ComContentUtils.getTextHelper() > > and append it in "content" with some special tags > > > > I then receive the content in an Custom Indexing filter plugin to extract > > the urls from the content and add it as a new field in NutchDocument. > > > > Is there a better way to do this? > > > > > > > > -- > > -Alan Francis > > VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de > julio de 2014. Ver www.uci.cu > -- -Alan Francis

