Strange, i get none of the http://id.wikipedia.org/wiki/mediawiki.. URL's as 
outlink using either parse-html or parse-tika on 1.4-dev. I also tried branch 
1.3 but don't see any outlinks with mediawiki substrings, even when i disabled 
parser.html.outlinks.ignore_tags

On Wednesday 12 October 2011 19:27:16 Michael.Sulistijo wrote:
> Hi all,
> 
> I have seem to found a problem with Nutch 1.3 Parser, currently it
> generating weird outlinks. Example that I will use is
> http://id.wikipedia.org/wiki/Halaman_Utama
> http://id.wikipedia.org/wiki/Halaman_Utama , First, I used the
> ParserChecker class to see how many outlinks it has, then I found out
> that, it generates some weird outlinks like :
> 
> ---------
> Url
> ---------------
> http://id.wikipedia.org/wiki/Halaman_Utama---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Wikipedia bahasa Indonesia, ensiklopedia bebas
> Outlinks: 369
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.startup
> anchor:
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.user anchor:
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.util anchor:
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.ready anchor:
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.wikibits
> anchor:
>   outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.ajax
> anchor: outlink: toUrl:
> http://id.wikipedia.org/wiki/mediawiki.legacy.mwsuggest anchor:
>   outlink: toUrl: *http://id.wikipedia.org/wiki/ext.flaggedRevs.advanced*
> anchor:
> ...
> 
> Then I check the page on the browser. Then I try to find those links above,
> but no result can be found, but when i tried to find
> "ext.flaggedRevs.advanced" on the page, this is what i found:
> ...
> 
> ...
> 
> I have set the "parser.html.outlinks.ignore_tags" property in the
> "nutch-site.xml" to ignore script tag. Also, I have looked to other threads
> like:
> http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally
> -as-absolute-URL-td3350098.html
> http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionall
> y-as-absolute-URL-td3350098.html
> http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionall
> y-as-absolute-URL-td3350098.html
> lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
> 
> even applied patch on my Nutch:
> https://issues.apache.org/jira/browse/NUTCH-1115
> https://issues.apache.org/jira/browse/NUTCH-1115
> 
> However, it is still showing those links that lead to "empty/non-existent"
> wikipedia articles. Anyone can shed a light on how to set up Nutch 1.3
> parser to exclude those kind of links ?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-3-Parser-generating-weird-outli
> nks-tp3416347p3416347.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to