So, after some research I think one of the 2 issues I reported earlier can
get fixed.

To refresh, the error I question is:
> fetch okay, but can't parse
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html,
reason: Content-Type not application/pdf:

The problem is that this page did not specify its content type in the header
and the PDF plugin loads first and has a "" for it's path suffix. Same goes
for the parse-HTML plugin. Therefore when the fetcher cannot get the content
type of a page (i.e. the page does not specify the content type) - the PDF
plugin gets called.

Now, its easy to fix this by putting "PDF" for the pathsuffix for the
parser-pdf...until I read this in Matt Kangas' documentation of the HTML
plugin (Wiki)

"This entry looks a bit strange with the empty pathSuffix value. But that
just means that this plugin doesn't match any pathSuffix value. So,
parse-html is only used when we fetch remote URLs, not anything residing on
the local filesystem."

Focusing on the sentence "So,.....filesystem".  Does this mean its best to
leave the pathsuffix blank if we want this invoked for remote URLs?  This
was a bit confusing.

***IS IT OKAY TO ADD PDF for the pathsuffix?  



And lastly, I think there may be a bug in the getSuffix() in
ParseFactory.java

We use full URLs including query string -- at times they may contain "/" or
"." Also, anchors "#" take any characters after on the URL.

Thus, to account for this function should be chaged as follows:

- newurl = substring or url till first "#" 
- newurl = substring of newurl till "?" 
        (this should give us a string that will be the "root" url)
- now look for the last "." and retunr till end of string.



 




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to