RE: [Nutch-dev] Fetch / Parse errors and a Bug

Chirag Chaman Wed, 29 Dec 2004 04:56:02 -0800

Swen:

Yes, this is related. Bill Goffe seems to have had the same problem.


So here's the EASY fix. I tested it over the last few hours with 100k pages
and it's working as it should. Simply add "pdf" and "doc" for the pathSuffix
of  parser-pdf and parser-doc.  In my opinion no other parser plugin should
have its pathsuffix left blank unless it wants to be the default handler --
HTML should only be the one.

I looked at where you mention that the content type is being looked up and
is Case Sensitive -- that is not correct. The HTTP protocol is adding the
Content-type to the TreeMap which is initialized with the
String.CASE_INSENSITIVE_ORDER comparator. Thus it internally will do a
case-insensitive match.

I think the problem is that no "content-type" was ever on the page -- this
leaves both the content type and the extension/suffix to be blank and that
causes a problem. Also, if a character-set is also not specified then the
fetcher fails as well (as it cannot write to disk).

I think we need to have global defaults if we encounter such a problem --
the Content type should be set to text/html and the character-set should be
ISO-8859 or UTF-8.

Doug, since you initially wrote the http protocol what's the best way to
proceed.

Thankx
CC

Just as a side note, it would be AWESOME if we can specify max fetch length
based on the document type. 64k is way too small for a PDF (as causes PDFs
to not be parsed) and 1MB while okay for PDFs, is way too big for an HTML
page. Can be easily implemented by adding a key to the plugin.xml for each
parser.



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Sven
Wende
Sent: Wednesday, December 29, 2004 5:38 AM
To: [EMAIL PROTECTED]
Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug

Hi,

just a short annotation. Some weeks ago I described a problem, which
strongly correlates to yours:

Please take a look at
http://sourceforge.net/mailarchive/message.php?msg_id=10249708 !

Maybe my considerations can help to find a working solution.

 

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Chirag Chaman
> Sent: Dienstag, 28. Dezember 2004 20:40
> To: [EMAIL PROTECTED]
> Subject: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> So, after some research I think one of the 2 issues I reported earlier 
> can get fixed.
> 
> To refresh, the error I question is:
> > fetch okay, but can't parse
> http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html,
> reason: Content-Type not application/pdf:
> 
> The problem is that this page did not specify its content type in the 
> header and the PDF plugin loads first and has a "" for it's path 
> suffix. Same goes for the parse-HTML plugin.
> Therefore when the fetcher cannot get the content type of a page (i.e. 
> the page does not specify the content type) - the PDF plugin gets 
> called.
> 
> Now, its easy to fix this by putting "PDF" for the pathsuffix for the 
> parser-pdf...until I read this in Matt Kangas'
> documentation of the HTML plugin (Wiki)
> 
> "This entry looks a bit strange with the empty pathSuffix value. But 
> that just means that this plugin doesn't match any pathSuffix value. 
> So, parse-html is only used when we fetch remote URLs, not anything 
> residing on the local filesystem."
> 
> Focusing on the sentence "So,.....filesystem".  Does this mean its 
> best to leave the pathsuffix blank if we want this invoked for remote 
> URLs?  This was a bit confusing.
> 
> ***IS IT OKAY TO ADD PDF for the pathsuffix?  
> 
> 
> 
> And lastly, I think there may be a bug in the getSuffix() in 
> ParseFactory.java
> 
> We use full URLs including query string -- at times they may contain 
> "/" or "." Also, anchors "#" take any characters after on the URL.
> 
> Thus, to account for this function should be chaged as follows:
> 
> - newurl = substring or url till first "#" 
> - newurl = substring of newurl till "?" 
>       (this should give us a string that will be the "root" url)
> - now look for the last "." and retunr till end of string.
> 
> 
> 
>  
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & candid 
> reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide Read honest & candid reviews
on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Fetch / Parse errors and a Bug

Reply via email to