[jira] [Commented] (NUTCH-990) protocol-httpclient fails with short pages

Julien Nioche (JIRA) Fri, 29 Apr 2011 14:01:43 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027187#comment-13027187
 ]


Julien Nioche commented on NUTCH-990:
-------------------------------------

@Markus : yes - won't fix 

Ideally we should come implement that as part of crawler-commons and use it as 
a dependency in Nutch. Crawler-commons has not received much interest lately 
but I am sure that Ken would be interested. 

@gabriele : can't find the issue. was probably within a different one
httpclient is indeed the only way to currently handle https, however it has 
absolutely nothing to do with pdf as it is about protocols, not content. Did 
you mean something else?




> protocol-httpclient fails with short pages
> ------------------------------------------
>
>                 Key: NUTCH-990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-990
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: hadoop.log
>
>
> Using protocol-http with a few words html pages works fine. But with 
> protocol-httpclient the same pages disappear from the index, although they 
> are still fetched.
> Those small files are useful for quick testing. 
> Steps to reproduce:
> $ svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 nutch-1.3
> Checked out revision 1097214.
> $ cd nutch-1.3
> $ xmlstarlet edit -L -u 
> "/configuration/property[name='http.agent.name']"/value -v 'test' 
> conf/nutch-default.xml
> $ ant
> Download to runtime/local the following script and seeds list file. They 
> assume a $HADOOP_HOME environment variable. It's a 1.3 adaptation of [1].
> http://dp4j.sf.net/debug/whole-web-crawling-incremental
> http://dp4j.sf.net/debug/urls
> $ cd runtime/local
> This will empty your Solr index (-f) and crawl:
> $ ./whole-web-crawling-incremental -f .
> Now Check Solr index searching for artificial and you will find the page 
> pointed to in urls.
> Now change plugin-includes in conf/nutch-default to use protocol-httpclient 
> instead of protocol-http and re-run the script. No more results in solr. Try 
> again with http and the results return.
> [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-990) protocol-httpclient fails with short pages

Reply via email to