So, after some research I think one of the 2 issues I reported earlier can get fixed.
To refresh, the error I question is: > fetch okay, but can't parse http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html, reason: Content-Type not application/pdf: The problem is that this page did not specify its content type in the header and the PDF plugin loads first and has a "" for it's path suffix. Same goes for the parse-HTML plugin. Therefore when the fetcher cannot get the content type of a page (i.e. the page does not specify the content type) - the PDF plugin gets called. Now, its easy to fix this by putting "PDF" for the pathsuffix for the parser-pdf...until I read this in Matt Kangas' documentation of the HTML plugin (Wiki) "This entry looks a bit strange with the empty pathSuffix value. But that just means that this plugin doesn't match any pathSuffix value. So, parse-html is only used when we fetch remote URLs, not anything residing on the local filesystem." Focusing on the sentence "So,.....filesystem". Does this mean its best to leave the pathsuffix blank if we want this invoked for remote URLs? This was a bit confusing. ***IS IT OKAY TO ADD PDF for the pathsuffix? And lastly, I think there may be a bug in the getSuffix() in ParseFactory.java We use full URLs including query string -- at times they may contain "/" or "." Also, anchors "#" take any characters after on the URL. Thus, to account for this function should be chaged as follows: - newurl = substring or url till first "#" - newurl = substring of newurl till "?" (this should give us a string that will be the "root" url) - now look for the last "." and retunr till end of string. ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
