Hi all,

I am newcomer about nutch. in my case, I want to crawl a specific website
that has lots of javascript urls, such as <a href=*javascript*(1);>,
so I wonder if nutch can know javascript urls, but after I find the
maillist, the result is the nutch doesn't support javascript urls,
so I decide to use simple way to solve it, that is to replace "<a href=*
javascript*(1);>" with "<a href='www.site.com/servlet?parameter=1'>" so that
the nutch can know it.
Is it correct?  I think the code need to be added before the nutch analyse
the contents, but how to patch the nutch to do it? anyone here know the
detail?

Yu
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to