As explained by Andrzej, who is using the suffix-urlfilter ? and how do you
use it ?
For my point of view and when i see the list of suffixes that are defined in
the conf file, it make more sense to skip a suffix as the end of URL
path component (.../css/media.css).
What do you think ?
Emmanuel JOKE wrote:
Hi Andrzej,
I don't want to bother you but i don't understand what you meant by "the
former".
I don't want to crawl any links which contains CSS or JS, I've
configured the urlfilter to avoid any corresponding suffixes but my
crawler still go though the following links:
http://www.toto.com/css/media.css?8
http://www.toto.com/ros/form.js?jashka8
I don't undertsnad why it should be normal. Could you help me to
understand ?
What do we call suffixes ?
That's the key question here. If you define a suffix as the end of URL
path component (.../css/media.css), then you are right that the current
urlfilter doesn't support it. However, if you define suffix as the end
of complete URL (.../css/media.css?8) then the current urlfilter works ok.
In your case the first definition would work best, and the second
definition works bad. The question for Nutch community is which
behaviour is generally more useful - if we come to conclusion that your
definition is more useful, then the suffix-urlfilter should be fixed in
the public code base. If we come to the opposite conclusion, then you
can still fix your local version of Nutch to follow the first definition.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general