[ https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche resolved NUTCH-990. --------------------------------- Resolution: Fixed Fix Version/s: (was: 1.3) 1.4 A patch has been committed recently that fixes the issues with compressed short pages - checkout the code from SVN See https://issues.apache.org/jira/browse/NUTCH-1089 Note that protocol-httpclient still needs replacing and is considered broken. See https://issues.apache.org/jira/browse/NUTCH-1086 > protocol-httpclient fails with short pages > ------------------------------------------ > > Key: NUTCH-990 > URL: https://issues.apache.org/jira/browse/NUTCH-990 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Gabriele Kahlout > Priority: Minor > Fix For: 1.4 > > Attachments: hadoop.log > > > Using protocol-http with a few words html pages works fine. But with > protocol-httpclient the same pages disappear from the index, although they > are still fetched. > Those small files are useful for quick testing. > Steps to reproduce: > $ svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 nutch-1.3 > Checked out revision 1097214. > $ cd nutch-1.3 > $ xmlstarlet edit -L -u > "/configuration/property[name='http.agent.name']"/value -v 'test' > conf/nutch-default.xml > $ ant > Download to runtime/local the following script and seeds list file. They > assume a $HADOOP_HOME environment variable. It's a 1.3 adaptation of [1]. > http://dp4j.sf.net/debug/whole-web-crawling-incremental > http://dp4j.sf.net/debug/urls > $ cd runtime/local > This will empty your Solr index (-f) and crawl: > $ ./whole-web-crawling-incremental -f . > Now Check Solr index searching for artificial and you will find the page > pointed to in urls. > Now change plugin-includes in conf/nutch-default to use protocol-httpclient > instead of protocol-http and re-run the script. No more results in solr. Try > again with http and the results return. > [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira