Hi Guys,
I've configured my nutch engine to crawl all dynamic links but exclude all
js, css and image files. I can see in my logs that the plugin Suffix URL
Filter (urlfilter-suffix) is loaded as shon below:
Registered Plugins:
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-05-03 15:48:08,277 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Suffix URL Filter
(urlfilter-suffix)
2007-05-03 15:48:08,293 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
Anyway, when I checked the logs I saw that he kept trying to fetch url with
JS and CSS extension like
2007-05-03 15:52:25,586 INFO fetcher.Fetcher - fetching
http://www.toto.com/css/media.css?8 2007-05-03 15:56:12,224 INFO
fetcher.Fetcher - fetching http://www.toto.com/ros/form.js?jashka8
It should not do that as I've clearly specified in my urlfilter to exclude
those files. I tried to look at the code and I think the plugin doesn't
manage correctly the dynamic URL with "?" and parameters after the
extension of the file.
Could you please help me on this subject and confirm if I'm right ?
Thanks
E
Regards,
~E~
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general