Hi all, I have seem to found a problem with Nutch 1.3 Parser, currently it generating weird outlinks. Example that I will use is http://id.wikipedia.org/wiki/Halaman_Utama http://id.wikipedia.org/wiki/Halaman_Utama , First, I used the ParserChecker class to see how many outlinks it has, then I found out that, it generates some weird outlinks like :
--------- Url --------------- http://id.wikipedia.org/wiki/Halaman_Utama--------- ParseData --------- Version: 5 Status: success(1,0) Title: Wikipedia bahasa Indonesia, ensiklopedia bebas Outlinks: 369 outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.startup anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.user anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.util anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.ready anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.wikibits anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.ajax anchor: outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.mwsuggest anchor: outlink: toUrl: *http://id.wikipedia.org/wiki/ext.flaggedRevs.advanced* anchor: ... Then I check the page on the browser. Then I try to find those links above, but no result can be found, but when i tried to find "ext.flaggedRevs.advanced" on the page, this is what i found: ... ... I have set the "parser.html.outlinks.ignore_tags" property in the "nutch-site.xml" to ignore script tag. Also, I have looked to other threads like: http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html even applied patch on my Nutch: https://issues.apache.org/jira/browse/NUTCH-1115 https://issues.apache.org/jira/browse/NUTCH-1115 However, it is still showing those links that lead to "empty/non-existent" wikipedia articles. Anyone can shed a light on how to set up Nutch 1.3 parser to exclude those kind of links ? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Parser-generating-weird-outlinks-tp3416347p3416347.html Sent from the Nutch - User mailing list archive at Nabble.com.

