reinhard schwab schrieb:
> Doğacan Güney schrieb:
>
>> On Fri, Jul 17, 2009 at 22:48, reinhard schwab<[email protected]> wrote:
>>
>>
>>> when i crawl a domain such as
>>>
>>> http://www.weissenkirchen.at/
>>>
>>> nutch extracts these outlinks.
>>> do they come from some heuristics?
>>>
>>>
>> These are probably coming from parse-js plugin. Javascript parser
>> does a best effort to extract outlinks but there will be many outlinks
>> that are broken.
>>
>>
> i have looked at JSParseFilter.
> heuristic is
>
> private static final String STRING_PATTERN =
> "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)";
> // A simple pattern. This allows also invalid URL characters.
> private static final String URI_PATTERN =
> "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";
> // Alternative pattern, which limits valid url characters.
>
> if the two patterns match, and if the constructed url is accepted by the
> url constructor
> without MalformedURLException,
> the "url" is collected.
>
> if i understand it correct,
> the second pattern matches everything with non whitespaces and dot.
> in the urls below i see html code and parts of arithmetic expressions.
>
the html code may come from document.write statements.
> may be the heuristic can be improved by checking for both cases.
> i also would appreciate some test code. especially heuristics needs to
> be tested.
> until now there is only one main method to test it.
>
>
>>
>>
>>> they seem obvious to be wrong and have status db_gone in crawldb.
>>>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+((110-pesp)/100)+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</A>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</A></TD>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</DIV>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</FONT>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iarw>0
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iarw+2):
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+i.ids+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iicw>0
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iicw+2):
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/kirchenwirt.js
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER></ILAYER></FONT></TD>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</LAYER></LAYER>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.height+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.width+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+m.maln+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+m.mei+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/)+(nVER>=5.5?(pehd!=
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(nVER<5.5?psds:0)+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/:p.efhd+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.isst
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.mei+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.plmw+2):
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppad+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppi+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.prmw+2):
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+p.pspc+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver?ssiz:
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+(s?p.efsh+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/+stgme(i).mbnk+
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/)+stittx(i)+(p.pver
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</STYLE>
>>> URL::
>>> http://www.weissenkirchen.at/kirchenwirt/<STYLE>\n.st_tbcss,.st_tdcss,.st_divcss,.st_ftcss{border:none;padding:0px;margin:0px;}\n</STYLE>
>>> URL:: http://www.weissenkirchen.at/kirchenwirt/</TABLE>
>>>
>>> more than 10 % of the tried pages have status db_gone
>>> and many of them are from wrong extracted outlinks.
>>>
>>> reinh...@thord:>bin/dump
>>> crawl/dump
>>> CrawlDb statistics start: crawl/crawldb
>>> Statistics for CrawlDb: crawl/crawldb
>>> TOTAL urls: 7199
>>> retry 0: 7048
>>> retry 1: 67
>>> retry 10: 1
>>> retry 12: 1
>>> retry 15: 3
>>> retry 17: 2
>>> retry 18: 2
>>> retry 19: 1
>>> retry 2: 56
>>> retry 4: 1
>>> retry 7: 14
>>> retry 9: 3
>>> min score: 0.0
>>> avg score: 0.014402139
>>> max score: 2.513
>>> status 1 (db_unfetched): 38
>>> status 2 (db_fetched): 6250
>>> status 3 (db_gone): 737
>>> status 4 (db_redir_temp): 148
>>> status 5 (db_redir_perm): 25
>>> status 6 (db_notmodified): 1
>>> CrawlDb statistics: done
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>