Unable to crawl SWF content from this <http://www.drivehq.com/help/swf/onlinebackup/OnlineBackupdemo.htm> . Even there was no exception while crawling only some part the SWF content is crawled(That is .... <str name="content">DriveHQ Online Backup - Live Demo </str>....). *PFB the solr output:*
<http://lucene.472066.n3.nabble.com/file/n4084301/SWFCrawlIssue.bmp> ................................ *1. Nutch-site.xml:* <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>plugin.folders</name> <value>D:/POC/trunk/build/plugins</value> </property> <property> <name>http.agent.name</name> <value>Your Nutch Spider</value> </property> <property> <name>http.accept</name> <value>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/javascript,application/x-shockwave-flash</value> <description>Value of the "Accept" request header field. </description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|swf|js)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> <property> <name>metatags.names</name> <value>description;keywords</value> <description> Names of the metatags to extract, separated by;. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description;metatag.keywords'. </description> </property> <property> <name>index.parse.md</name> <value>metatag.description,metatag.keywords</value> <description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) </description> </property> <property> <name>parser.timeout</name> <value>-1</value> </property> <property> <name>parser.skip.truncated</name> <value>false</value> <description>Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. </description> </property> <property> <name>http.content.limit</name> <value>5097152</value> </property> <property> <name>file.content.limit</name> <value>5097152</value> <description>The length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>fetcher.server.delay</name> <value>30.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server.</description> </property> </configuration> .............................................. *2. I have added swf plugin in the nutch configuration file. * When i try to crawl a simple SWF page with no animation like this <http://www.swftools.org/flash/textsnapshot.html> , i could able to view the proper crawled output. I don't know what configuration is missed about. Any suggestions are appreciated. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-flash-based-webpages-SWF-in-Nutch1-x-tp4084301.html Sent from the Nutch - User mailing list archive at Nabble.com.

