Hi,

Do i understand you correctly if you want all iframe src attributes on a given 
page stored in the iframe field?

The src attributes are not extracted and there is no facility to do so right 
now. You should create your own HTMLParseFilter, loop through the document 
looking for iframe tags and collect the src attribute. Then add those as parse 
metadata. You can then index them with the index-metadata plugin. I'm not sure 
it supports multi valued metafields in Nutch 1.6, it sure will in 1.7.

Use the bin/nutch parsechecker and indexchecker tools to check if your plugin 
works.

Cheers

 
 
-----Original message-----
> From:Amit Sela <am...@infolinks.com>
> Sent: Tuesday 25th June 2013 16:26
> To: user@nutch.apache.org
> Subject: Fetch iframe from HTML (if exists)
> 
> Hi all,
> 
> I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe
> src field into Solr.
> i.e.,
> <iframe src="something" scrolling="" frameborder="".......>
> So i want to fetch the iframe and index it as iframe so that I could find
> URLS by iframe src.
> 
> I'm crawling with no depth over a seed list, and I don't want to crawl to
> the iframe src, just to index and store it.
> 
> I tried adding
> <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
> 
> and
> <field name="iframe" type="text_general" stored="true" indexed="true"
> multiValued="true"/> to schema.xml
> 
> and
> <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
> 
> What am I missing ?
> 
> Thanks,
> 
> Amit.
> 

Reply via email to