Hi all,

I have seem to found a problem with Nutch 1.3 Parser, currently it
generating weird outlinks. Example that I will use is 
http://id.wikipedia.org/wiki/Halaman_Utama
http://id.wikipedia.org/wiki/Halaman_Utama , First, I used the ParserChecker
class to see how many outlinks it has, then I found out that, it generates
some weird outlinks like :

---------
Url
---------------
http://id.wikipedia.org/wiki/Halaman_Utama---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Wikipedia bahasa Indonesia, ensiklopedia bebas
Outlinks: 369
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.startup
anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.user anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.util anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.ready anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.wikibits
anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.ajax anchor: 
  outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.mwsuggest
anchor: 
  outlink: toUrl: *http://id.wikipedia.org/wiki/ext.flaggedRevs.advanced*
anchor: 
...

Then I check the page on the browser. Then I try to find those links above,
but no result can be found, but when i tried to find
"ext.flaggedRevs.advanced" on the page, this is what i found:
...

...

I have set the "parser.html.outlinks.ignore_tags" property in the
"nutch-site.xml" to ignore script tag. Also, I have looked to other threads
like: 
http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html
http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html
 
http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally-as-absolute-URL-td3350098.html
lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html 

even applied patch on my Nutch:
https://issues.apache.org/jira/browse/NUTCH-1115
https://issues.apache.org/jira/browse/NUTCH-1115 

However, it is still showing those links that lead to "empty/non-existent"
wikipedia articles. Anyone can shed a light on how to set up Nutch 1.3
parser to exclude those kind of links ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-1-3-Parser-generating-weird-outlinks-tp3416347p3416347.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to