Hi Dogacan,

  We are pretty sure. We were having problems with 3 urls. We put some debug
statements in HttpResponse.java. This is what we got:

URL = http://perso0.free.fr/cgi-bin/guestbook.pl?login=kobudo.okinawa
cType = 'text/plain'
mimeType = 'text/plain'
allowed download limit for this mimetype is 0
download file is smalled then the max therefore setting actual filesize as
download limit
Download size 282
URL = http://www.prospect-magazine.co.uk/list.php?related_article=9635
cType = 'text/html; charset=ISO-8859-1'
setting filesize as Integer.Max
Download size 2147483647
cType = 'text/plain; charset=ISO-8859-1'
setting filesize as Integer.Max
Download size 2147483647
URL =
http://www.muschihaus.de/vol4/templates/guestbook.php?name=Guestbook&image=g
uestbook
cType = 'text/html; charset=UTF-8'
setting filesize as Integer.Max
Download size 2147483647
cType = 'text/html; charset=ISO-8859-1'
setting filesize as Integer.Max
Download size 2147483647

>From this, I inferred that the cType is not set correctly to "text/html"
here. Also, the content limit is set to Integer.Max, and the
http.content.limit (64K) is ignored for 2 of the urls.

Regards,

-vishal.
-----Original Message-----
From: Dogacan Güney [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 21, 2007 4:44 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: http.content.limit not respected when the Content-Type header
has charset attributes

On 6/21/07, Vishal Shah <[EMAIL PROTECTED]> wrote:
>  Hi,
>
>   Many of the urls we crawl have headers that look like this:
>
> Connection: close
> Date: Thu, 21 Jun 2007 09:28:42 GMT
> Accept-Ranges: bytes
> ETag: "2c0c3-650-cc1eb800"
> Server: Apache/2.0.40 (Red Hat Linux)
> Content-Length: 1616
> Content-Type: text/html; charset=ISO-8859-1
> Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT
> Client-Date: Thu, 21 Jun 2007 07:42:10 GMT
> Client-Peer: 202.141.129.22:80
> Client-Response-Num: 1
>
> In this case, the cType variable is set to "text/html; charset=ISO-8859-1"
> in HttpResponse.java (for both protocol-http and protocol-httpclient). In
> this case, the mimeType cannot be found correctly in HttpResponse.java. I
am
> talking about this piece of code here:
>
>      /*
>        * Extract the content type from the response and then look for its
>        * mimetype preferences specified in mime-type.xml
>        */
>      String ctype = headers.get(Response.CONTENT_TYPE);
>       int downloadSize = 0;
>       if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype))
> != null) {
>
> In this case, the ctype should actually be set to just "text/html".
> Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType
> variable is coming out to be null. Thus neither the content limit
specified
> in mimetypes.xml nor the http.content.limit setting is respected for these
> documents.
>
> One solution to the problem is to actually check the cType, split on ";"
and
> take the first part to lookup the mimeType. Anyone got any other ideas?

Are you sure about this? I haven't examined codes there carefully,
however, I tested a crawl with a sample url: http://www.metu.edu.tr/

Page returns these headers:

Date: Thu, 21 Jun 2007 10:59:35 GMT
Server: Apache
X-Powered-By: PHP/5.1.4
Connection: close
Content-Type: text/html; charset=ISO-8859-9

and this is the output of readseg -get:

Content::
Version: 2
url: http://www.metu.edu.tr/
base: http://www.metu.edu.tr/
contentType: text/html
metadata: X-Powered-By=PHP/5.1.4 Connection=close
nutch.segment.name=20070621125200 nutch.crawl.score=1.0 Date=Thu, 21
Jun 2007 10:52:13 GMT Server=Apache Content-Type=text/html;
charset=ISO-8859-9
Content:
...

Content-type seems to be picked up correctly.

btw, there is already a StringUtil.parseCharacterEncoding that is
designed to parse the encoding part of Content-Type header.

(Also, I couldn't find the code you were mentioning. Where is it, exactly?)

>
> -vishal.
>


-- 
Dogacan Güney


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to