Hi Guys,

I've configured my nutch engine to crawl all dynamic links but exclude all
js, css and image files. I can see in my logs that the plugin Suffix URL
Filter (urlfilter-suffix) is loaded as shon below:
Registered Plugins:
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  CyberNeko HTML
Parser (lib-nekohtml)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Site Query Filter
(query-site)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Basic URL
Normalizer (urlnormalizer-basic)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Html Parse Plug-in
(parse-html)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Pass-through URL
Normalizer (urlnormalizer-pass)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Regex URL Filter
Framework (lib-regex-filter)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Basic Indexing
Filter (index-basic)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Basic Summarizer
Plug-in (summary-basic)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Text Parse Plug-in
(parse-text)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  JavaScript Parser
(parse-js)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Regex URL Filter
(urlfilter-regex)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  Basic Query Filter
(query-basic)
2007-05-03 15:48:08,277 INFO  plugin.PluginRepository -  HTTP Framework
(lib-http)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  URL Query Filter
(query-url)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  Regex URL
Normalizer (urlnormalizer-regex)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  Http Protocol
Plug-in (protocol-http)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  the nutch core
extension points (nutch-extensionpoints)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  Suffix URL Filter
(urlfilter-suffix)
2007-05-03 15:48:08,293 INFO  plugin.PluginRepository -  OPIC Scoring
Plug-in (scoring-opic)


Anyway, when I checked the logs I saw that he kept trying to fetch url with
JS and CSS extension like
2007-05-03 15:52:25,586 INFO  fetcher.Fetcher - fetching
http://www.toto.com/css/media.css?8 2007-05-03 15:56:12,224 INFO
fetcher.Fetcher - fetching http://www.toto.com/ros/form.js?jashka8

It should not do that as I've clearly specified in my urlfilter to exclude
those files. I tried to look at the code and I think the plugin doesn't
manage correctly the dynamic URL  with "?"  and parameters after the
extension of the file.
Could you please help me on this subject and confirm if I'm right ?

Thanks
E

Regards,
~E~
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to