Doğacan Güney schrieb: > On Fri, Jul 17, 2009 at 22:48, reinhard schwab<[email protected]> wrote: > >> when i crawl a domain such as >> >> http://www.weissenkirchen.at/ >> >> nutch extracts these outlinks. >> do they come from some heuristics? >> > > These are probably coming from parse-js plugin. Javascript parser > does a best effort to extract outlinks but there will be many outlinks > that are broken. > i have looked at JSParseFilter. heuristic is
private static final String STRING_PATTERN = "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)"; // A simple pattern. This allows also invalid URL characters. private static final String URI_PATTERN = "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)"; // Alternative pattern, which limits valid url characters. if the two patterns match, and if the constructed url is accepted by the url constructor without MalformedURLException, the "url" is collected. if i understand it correct, the second pattern matches everything with non whitespaces and dot. in the urls below i see html code and parts of arithmetic expressions. may be the heuristic can be improved by checking for both cases. i also would appreciate some test code. especially heuristics needs to be tested. until now there is only one main method to test it. > >> they seem obvious to be wrong and have status db_gone in crawldb. >> >> URL:: http://www.weissenkirchen.at/kirchenwirt/+((110-pesp)/100)+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/</A> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</A></TD> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</DIV> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</FONT> >> URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iarw>0 >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iarw+2): >> URL:: http://www.weissenkirchen.at/kirchenwirt/+i.ids+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iicw>0 >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iicw+2): >> URL:: http://www.weissenkirchen.at/kirchenwirt/kirchenwirt.js >> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER></ILAYER></FONT></TD> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER></LAYER> >> URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.height+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.width+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+m.maln+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+m.mei+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/)+(nVER>=5.5?(pehd!= >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(nVER<5.5?psds:0)+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/:p.efhd+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.isst >> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.mei+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.plmw+2): >> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppad+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppi+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.prmw+2): >> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.pspc+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver?ssiz: >> URL:: http://www.weissenkirchen.at/kirchenwirt/+(s?p.efsh+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/+stgme(i).mbnk+ >> URL:: http://www.weissenkirchen.at/kirchenwirt/)+stittx(i)+(p.pver >> URL:: http://www.weissenkirchen.at/kirchenwirt/</STYLE> >> URL:: >> http://www.weissenkirchen.at/kirchenwirt/<STYLE>\n.st_tbcss,.st_tdcss,.st_divcss,.st_ftcss{border:none;padding:0px;margin:0px;}\n</STYLE> >> URL:: http://www.weissenkirchen.at/kirchenwirt/</TABLE> >> >> more than 10 % of the tried pages have status db_gone >> and many of them are from wrong extracted outlinks. >> >> reinh...@thord:>bin/dump >> crawl/dump >> CrawlDb statistics start: crawl/crawldb >> Statistics for CrawlDb: crawl/crawldb >> TOTAL urls: 7199 >> retry 0: 7048 >> retry 1: 67 >> retry 10: 1 >> retry 12: 1 >> retry 15: 3 >> retry 17: 2 >> retry 18: 2 >> retry 19: 1 >> retry 2: 56 >> retry 4: 1 >> retry 7: 14 >> retry 9: 3 >> min score: 0.0 >> avg score: 0.014402139 >> max score: 2.513 >> status 1 (db_unfetched): 38 >> status 2 (db_fetched): 6250 >> status 3 (db_gone): 737 >> status 4 (db_redir_temp): 148 >> status 5 (db_redir_perm): 25 >> status 6 (db_notmodified): 1 >> CrawlDb statistics: done >> >> >> >> > > > >
