I check the parser using this command bin/nutch parsechecker -dumpText http://dreamdj.altervista.org/ fetching: http://dreamdj.altervista.org/ http://dreamdj.altervista.org/ skipped. Content of size 76 was truncated to 69 Content is truncated, parse may fail! parsing: http://dreamdj.altervista.org/ contentType: text/html signature: 81826638e0e160ab22c74f9a4628221a --------- Url ---------------
http://dreamdj.altervista.org/ --------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 1 outlink: toUrl: http://dreamdj.altervista.org/a22.html anchor: a2 Content Metadata: ETag="115000d-45-4e94159b4a680" Vary=Accept-Encoding Date=Tue, 22 Oct 2013 13:23:59 GMT Content-Length=76 Content-Encoding=gzip Last-Modified=Mon, 21 Oct 2013 14:46:34 GMT Content-Type=text/html Connection=close Accept-Ranges=bytes Server=Apache Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 --------- ParseText --------- and find a error line http://dreamdj.altervista.org/ skipped. Content of size 76 was truncated to 69 I try the protocol-http and protocol-httpclient plugin and get the same result. and I try the wget command wget --server-response http://dreamdj.altervista.org/--2013-10-2222:00:08-- http://dreamdj.altervista.org/ Resolving dreamdj.altervista.org (dreamdj.altervista.org)... 176.9.38.231 Connecting to dreamdj.altervista.org (dreamdj.altervista.org)|176.9.38.231|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Tue, 22 Oct 2013 14:01:24 GMT Server: Apache Last-Modified: Mon, 21 Oct 2013 14:46:34 GMT ETag: "115000d-45-4e94159b4a680" Accept-Ranges: bytes Vary: Accept-Encoding Content-Length: 69 Keep-Alive: timeout=1, max=100 Connection: Keep-Alive Content-Type: text/html Length: 69 [text/html] Saving to: `index.html.4' It got the correct content length. this issue is that protocol-http plugin add accept encoding in http header like this reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n"); and the http server return the gzip content, but compressed content size is larger that decompressed size. here compressed size is 76 and decompressed size is 69. so the parser think that content is truncated. int actualSize = contentBytes.length; if (inHeaderSize > actualSize) { LOG.info(url + " skipped. Content of size " + inHeaderSize + " was truncated to " + actualSize); return true; } so this page is skipped. maybe this is a issue. On Tue, Oct 22, 2013 at 7:04 PM, ozzy19 <[email protected]> wrote: > Ok thanks for the help! > I have another problem I'm trying to crawl this example of a site: > http://dreamdj.altervista.org/ > with the following command: > > nutch crawl -dir crawl -depth 5 -topN 3 > > Why do I get only the first page? other links do not appear in the results! > This is the file nutch-site: > <property> > <name>http.agent.name</name> > <value>NLP</value> > </property> > > <property> > <name>http.robots.agents</name> > <value>NLP,*</value> > </property> > > <property> > <name>plugin.folders</name> > <value>/home/enzo/Scrivania/nutch/apache-nutch-1.7/src/plugin</value> > </property> > > <property> > <name>urlfilter.regex.file</name> > <value>regex-urlfilter.txt</value> > </property> > > > while the file "regex-urlfilter" I left it by default. > Why not capture the other links? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/help-me-with-nutch-tp4095914p4096998.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

