Elwin wrote:
 for example: <a href="javascript:customCss(6017162)"
 id="customCssMenu" >test</a> in fact, can nutch get content from such
 kind of urls?



Not without some drastic changes... I have an early implementation of a fetcher that uses httpunit library to actually interpret the javascript and mimick browser's behavior. The problem is that it's very slow - current fetcher implementation is stateless, the one that would support javascript needs to be stateful, and it needs to retrieve multiple resources in one go (e.g. CSS, frames, script files, the main body, etc). Then, discovering all outlinks requires a simulated "click" on all active elements, which in turn requires executing all scripts associated with all current windows. If scripts are not idempotent, you need to simulate the "Back" button, or drop/reload everything to restore previous state ...

So, it's not easy. Your best bet would be to use a separate fetcher to fetch these problematic sites, and use the standard fetcher to fetch everything else.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to