Swen: Yes, this is related. Bill Goffe seems to have had the same problem.
So here's the EASY fix. I tested it over the last few hours with 100k pages and it's working as it should. Simply add "pdf" and "doc" for the pathSuffix of parser-pdf and parser-doc. In my opinion no other parser plugin should have its pathsuffix left blank unless it wants to be the default handler -- HTML should only be the one. I looked at where you mention that the content type is being looked up and is Case Sensitive -- that is not correct. The HTTP protocol is adding the Content-type to the TreeMap which is initialized with the String.CASE_INSENSITIVE_ORDER comparator. Thus it internally will do a case-insensitive match. I think the problem is that no "content-type" was ever on the page -- this leaves both the content type and the extension/suffix to be blank and that causes a problem. Also, if a character-set is also not specified then the fetcher fails as well (as it cannot write to disk). I think we need to have global defaults if we encounter such a problem -- the Content type should be set to text/html and the character-set should be ISO-8859 or UTF-8. Doug, since you initially wrote the http protocol what's the best way to proceed. Thankx CC Just as a side note, it would be AWESOME if we can specify max fetch length based on the document type. 64k is way too small for a PDF (as causes PDFs to not be parsed) and 1MB while okay for PDFs, is way too big for an HTML page. Can be easily implemented by adding a key to the plugin.xml for each parser. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sven Wende Sent: Wednesday, December 29, 2004 5:38 AM To: [EMAIL PROTECTED] Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug Hi, just a short annotation. Some weeks ago I described a problem, which strongly correlates to yours: Please take a look at http://sourceforge.net/mailarchive/message.php?msg_id=10249708 ! Maybe my considerations can help to find a working solution. > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Chirag Chaman > Sent: Dienstag, 28. Dezember 2004 20:40 > To: [EMAIL PROTECTED] > Subject: [Nutch-dev] Fetch / Parse errors and a Bug > > So, after some research I think one of the 2 issues I reported earlier > can get fixed. > > To refresh, the error I question is: > > fetch okay, but can't parse > http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html, > reason: Content-Type not application/pdf: > > The problem is that this page did not specify its content type in the > header and the PDF plugin loads first and has a "" for it's path > suffix. Same goes for the parse-HTML plugin. > Therefore when the fetcher cannot get the content type of a page (i.e. > the page does not specify the content type) - the PDF plugin gets > called. > > Now, its easy to fix this by putting "PDF" for the pathsuffix for the > parser-pdf...until I read this in Matt Kangas' > documentation of the HTML plugin (Wiki) > > "This entry looks a bit strange with the empty pathSuffix value. But > that just means that this plugin doesn't match any pathSuffix value. > So, parse-html is only used when we fetch remote URLs, not anything > residing on the local filesystem." > > Focusing on the sentence "So,.....filesystem". Does this mean its > best to leave the pathsuffix blank if we want this invoked for remote > URLs? This was a bit confusing. > > ***IS IT OKAY TO ADD PDF for the pathsuffix? > > > > And lastly, I think there may be a bug in the getSuffix() in > ParseFactory.java > > We use full URLs including query string -- at times they may contain > "/" or "." Also, anchors "#" take any characters after on the URL. > > Thus, to account for this function should be chaged as follows: > > - newurl = substring or url till first "#" > - newurl = substring of newurl till "?" > (this should give us a string that will be the "root" url) > - now look for the last "." and retunr till end of string. > > > > > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & candid > reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
