Hi - i think i would implement a custom parser filter that looks for specific 
tags and attributes and add it to the parse meta data. Using the index-metatags 
plugin i would then have those newly added fields indexed.

 
Markus

 
-----Original message-----
From:Alan Francis <[email protected]>
Sent:Tue 27-05-2014 15:47
Subject:Identifying Video Links in Pages
To:[email protected]; 
I have a use case in which we want to separate pages which have an iframe
embed tag from youtube. and add it as a additional field for indexing.

I am using apache Nutch 1.8 with Solr 4.8

What I have done so far is to over-ride the "parse-html" plugin and
identify iframe tags with youtube urls in ComContentUtils.getTextHelper()
and append it in "content" with some special tags

I then receive the content in an Custom Indexing filter plugin to extract
the urls from the content and add it as a new field in NutchDocument.

Is there a better way to do this?



-- 
-Alan Francis

Reply via email to