Unable to crawl flash based webpages(SWF) in Nutch1.x

jagadeesh9.k Tue, 13 Aug 2013 07:20:41 -0700

Unable to crawl SWF content from  this
<http://www.drivehq.com/help/swf/onlinebackup/OnlineBackupdemo.htm>  . Even
there was no exception while crawling only some part the SWF content is
crawled(That is .... <str name="content">DriveHQ Online Backup - Live Demo     
</str>....). 
*PFB the solr output:*


<http://lucene.472066.n3.nabble.com/file/n4084301/SWFCrawlIssue.bmp> 
................................


*1. Nutch-site.xml:* 

 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
 <property>

    <name>plugin.folders</name>

    <value>D:/POC/trunk/build/plugins</value>

   </property>

  <property>

    <name>http.agent.name</name>

    <value>Your Nutch Spider</value>

  </property>
  
  <property>
  <name>http.accept</name>
 
<value>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/javascript,application/x-shockwave-flash</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>
  
  <property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|swf|js)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with
the 
  underlying commons-httpclient library.
  </description>
</property>

 
 <property>
  <name>metatags.names</name>
   <value>description;keywords</value>
   <description> Names of the metatags to extract, separated by;.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
  </description>
 </property>
 
 <property>
  <name>index.parse.md</name>
  <value>metatag.description,metatag.keywords</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to
generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>

<property>
  <name>parser.timeout</name>
  <value>-1</value>
</property>

<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for
truncated documents. By default this 
  property is activated due to extremely high levels of CPU which parsing
can sometimes take.  
  </description>
</property>

<property>
<name>http.content.limit</name> 
<value>5097152</value>
</property>

<property>
  <name>file.content.limit</name>
  <value>5097152</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>30.0</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server.</description>
</property>
</configuration>
..............................................

*2. I have added swf plugin in the nutch configuration file. *

When i try to crawl a simple SWF page with no animation like this
<http://www.swftools.org/flash/textsnapshot.html>   , i could able to view
the proper crawled output.
I don't know what configuration is missed about. Any suggestions are
appreciated.  thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-crawl-flash-based-webpages-SWF-in-Nutch1-x-tp4084301.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Unable to crawl flash based webpages(SWF) in Nutch1.x

Reply via email to