I am using Nutch to parse a university website. I found that the html
parser can not work properly. Here is the problem.

For the two webpages of students,

1. http://kdd.csd.uwo.ca/doku.php/people/yan_luo
2. http://kdd.csd.uwo.ca/doku.php/people/xiao_li

I use command "nutch parsechecker -dumpText
http://kdd.csd.uwo.ca/doku.php/people/yan_luo";
and "nutch parsechecker -dumpText
http://kdd.csd.uwo.ca/doku.php/people/xiao_li"; to extract text.

However, nutch can only extract text from the first webpage but give no
results for the second one. I do not understand why. The two webpages come
from a template.

Reply via email to