I check the parser using this command

bin/nutch parsechecker -dumpText http://dreamdj.altervista.org/
fetching: http://dreamdj.altervista.org/
http://dreamdj.altervista.org/ skipped. Content of size 76 was truncated to
69
Content is truncated, parse may fail!
parsing: http://dreamdj.altervista.org/
contentType: text/html
signature: 81826638e0e160ab22c74f9a4628221a
---------
Url
---------------

http://dreamdj.altervista.org/
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title:
Outlinks: 1
  outlink: toUrl: http://dreamdj.altervista.org/a22.html anchor: a2
Content Metadata: ETag="115000d-45-4e94159b4a680" Vary=Accept-Encoding
Date=Tue, 22 Oct 2013 13:23:59 GMT Content-Length=76 Content-Encoding=gzip
Last-Modified=Mon, 21 Oct 2013 14:46:34 GMT Content-Type=text/html
Connection=close Accept-Ranges=bytes Server=Apache
Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252
---------
ParseText
---------

and find a error line

http://dreamdj.altervista.org/ skipped. Content of size 76 was truncated to
69

I try the protocol-http and protocol-httpclient plugin and get the same
result.

and I try the wget command

wget --server-response http://dreamdj.altervista.org/--2013-10-2222:00:08--
http://dreamdj.altervista.org/
Resolving dreamdj.altervista.org (dreamdj.altervista.org)... 176.9.38.231
Connecting to dreamdj.altervista.org
(dreamdj.altervista.org)|176.9.38.231|:80...
connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Tue, 22 Oct 2013 14:01:24 GMT
  Server: Apache
  Last-Modified: Mon, 21 Oct 2013 14:46:34 GMT
  ETag: "115000d-45-4e94159b4a680"
  Accept-Ranges: bytes
  Vary: Accept-Encoding
  Content-Length: 69
  Keep-Alive: timeout=1, max=100
  Connection: Keep-Alive
  Content-Type: text/html
Length: 69 [text/html]
Saving to: `index.html.4'

It got the correct content length.

this issue is that protocol-http plugin add accept encoding in http header
like this

reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n");

and the http server return the gzip content, but compressed content size is
larger that decompressed size. here compressed size is 76 and decompressed
size is 69. so the parser think that content is truncated.

    int actualSize = contentBytes.length;
    if (inHeaderSize > actualSize) {
      LOG.info(url + " skipped. Content of size " + inHeaderSize
          + " was truncated to " + actualSize);
      return true;
    }

so this page is skipped.

maybe this is a issue.


On Tue, Oct 22, 2013 at 7:04 PM, ozzy19 <[email protected]> wrote:

> Ok thanks for the help!
> I have another problem I'm trying to crawl this example of a site:
> http://dreamdj.altervista.org/
> with the following command:
>
> nutch crawl -dir crawl  -depth 5 -topN 3
>
> Why do I get only the first page? other links do not appear in the results!
> This is the file nutch-site:
>  <property>
>   <name>http.agent.name</name>
>   <value>NLP</value>
>  </property>
>
> <property>
>   <name>http.robots.agents</name>
>   <value>NLP,*</value>
> </property>
>
> <property>
>   <name>plugin.folders</name>
>   <value>/home/enzo/Scrivania/nutch/apache-nutch-1.7/src/plugin</value>
> </property>
>
> <property>
>   <name>urlfilter.regex.file</name>
>   <value>regex-urlfilter.txt</value>
> </property>
>
>
> while the file "regex-urlfilter" I left it by default.
> Why not capture the other links?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/help-me-with-nutch-tp4095914p4096998.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to