Hi, just a short annotation. Some weeks ago I described a problem, which strongly correlates to yours:
Please take a look at http://sourceforge.net/mailarchive/message.php?msg_id=10249708 ! Maybe my considerations can help to find a working solution. > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > Behalf Of Chirag Chaman > Sent: Dienstag, 28. Dezember 2004 20:40 > To: [EMAIL PROTECTED] > Subject: [Nutch-dev] Fetch / Parse errors and a Bug > > So, after some research I think one of the 2 issues I > reported earlier can get fixed. > > To refresh, the error I question is: > > fetch okay, but can't parse > http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html, > reason: Content-Type not application/pdf: > > The problem is that this page did not specify its content > type in the header and the PDF plugin loads first and has a > "" for it's path suffix. Same goes for the parse-HTML plugin. > Therefore when the fetcher cannot get the content type of a > page (i.e. the page does not specify the content type) - the > PDF plugin gets called. > > Now, its easy to fix this by putting "PDF" for the pathsuffix > for the parser-pdf...until I read this in Matt Kangas' > documentation of the HTML plugin (Wiki) > > "This entry looks a bit strange with the empty pathSuffix > value. But that just means that this plugin doesn't match any > pathSuffix value. So, parse-html is only used when we fetch > remote URLs, not anything residing on the local filesystem." > > Focusing on the sentence "So,.....filesystem". Does this > mean its best to leave the pathsuffix blank if we want this > invoked for remote URLs? This was a bit confusing. > > ***IS IT OKAY TO ADD PDF for the pathsuffix? > > > > And lastly, I think there may be a bug in the getSuffix() in > ParseFactory.java > > We use full URLs including query string -- at times they may > contain "/" or "." Also, anchors "#" take any characters > after on the URL. > > Thus, to account for this function should be chaged as follows: > > - newurl = substring or url till first "#" > - newurl = substring of newurl till "?" > (this should give us a string that will be the "root" url) > - now look for the last "." and retunr till end of string. > > > > > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & > candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
