Re: [VOTE] Release Apache Nutch 1.8RC#2

2014-03-16 Thread Julien Nioche
+1 from me. Thanks everyone

On Sunday, 16 March 2014, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1 from me!

 SIGS pass, CHECKSUMS pass:

 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/stage_apache_rc
 apache-nutch 1.8-bin https://dist.apache.org/repos/dist/dev/nutch/
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 79.7M  100 79.7M0 0   894k  0  0:01:31  0:01:31 --:--:--
 926k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2291  0 --:--:-- --:--:-- --:--:--
 2902
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10078  100780 0214  0 --:--:-- --:--:-- --:--:--
 268
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 81.0M  100 81.0M0 0   828k  0  0:01:40  0:01:40 --:--:--
 809k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2399  0 --:--:-- --:--:-- --:--:--
 3051
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10075  100750 0201  0 --:--:-- --:--:-- --:--:--
 255
 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/stage_apache_rc
 apache-nutch 1.8-src https://dist.apache.org/repos/dist/dev/nutch/
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 2692k  100 2692k0 0   602k  0  0:00:04  0:00:04 --:--:--
 646k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2306  0 --:--:-- --:--:-- --:--:--
 2912
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10078  100780 0204  0 --:--:-- --:--:-- --:--:--
 255
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100 4547k  100 4547k0 0   564k  0  0:00:08  0:00:08 --:--:--
 671k
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 100   836  100   8360 0   2182  0 --:--:-- --:--:-- --:--:--
 2814
   % Total% Received % Xferd  Average Speed   TimeTime Time
 Current
  Dload  Upload   Total   SpentLeft
 Speed
 10075  100750 0203  0 --:--:-- --:--:-- --:--:--
 268
 [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/verify_gpg_sigs
 Verifying Signature for file apache-nutch-1.8-bin.tar.gz.asc
 gpg: Signature made Tue Mar 11 14:23:44 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-bin.zip.asc
 gpg: Signature made Tue Mar 11 14:25:56 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-src.tar.gz.asc
 gpg: Signature made Tue Mar 11 14:26:28 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org javascript:;
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:  There is no indication that the signature belongs to the
 owner.
 Primary key fingerprint: DB7B 5199 121C 08A5 C8F4  052B 3A47 17F0 48BA EBF6
 Verifying Signature for file apache-nutch-1.8-src.zip.asc
 gpg: Signature made Tue Mar 11 14:26:44 2014 PDT using RSA key ID 48BAEBF6
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY)
 lewi...@apache.org 

[jira] [Commented] (NUTCH-1737) Upgrade to recent JUnit 4.x

2014-03-16 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937115#comment-13937115
 ] 

Lewis John McGibbney commented on NUTCH-1737:
-

I'll be happy to take this on later Seb. It is a PITA but it is time well 
invested.

 Upgrade to recent JUnit 4.x
 ---

 Key: NUTCH-1737
 URL: https://issues.apache.org/jira/browse/NUTCH-1737
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.8
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1737-trivial.patch


 While trunk still remains on JUnit 3.8.1, 2.x uses JUnit 4.11 and has already 
 upgraded tests (NUTCH-1573). This makes it difficult to port tests which use 
 JUnit 4 features from 2.x to trunk. There are two solutions:
 # (lightweight, trivial patch attached) upgrade only ivy dependency, upgrade 
 tests later
 # upgrade also all tests to use JUnit 4 annotations and setup(), cf. 
 NUTCH-1573



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1734) Make SolrIndexWriter more intelligent

2014-03-16 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937137#comment-13937137
 ] 

Lewis John McGibbney commented on NUTCH-1734:
-

Excellent to see you log this issue [~la...@protulae.com]. We can keep 
discussion of the issue here.

 Make SolrIndexWriter more intelligent
 -

 Key: NUTCH-1734
 URL: https://issues.apache.org/jira/browse/NUTCH-1734
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2.1
Reporter: Lajos Moczar
Priority: Minor

 The current mapping of the NutchDocument to SolrDocument is based on the 
 fields in the former which potentially can cause problems when you are using 
 an existing Solr schema:
 1) the existing logic requires Solr to support all Nutch fields, which might 
 not be the case (like segment).
 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a field 
 and one via a copy tag because the source attribute is the Map key and 
 therefore you can only have one.
 Additionally, it would be nice to support some level of transformations, 
 literals, etc, like used in Solr DIH.
 I propose to make the code more intelligent so that, while supporting the 
 existing strict mapping that people are used to, allows more flexible and 
 intelligent mapping. It will also include a transformation architecture that 
 can be expanded over time.
 The general approach is to reverse the building of the SolrDocument, and 
 populate the doc based on the Solr destination fields as defined in 
 solrindex-mapping.xml, i.e., it populates the doc based on what the target 
 Solr wants to receive, not just what Nutch wants to send. The Map of fields 
 in solrindex-mapping.xml will be keyed by dest, i.e. the Solr field name, not 
 source. That way one can map a source to multiple destinations. A mapping 
 type attribute (defaults to just a simple copy from Nutch to Solr) will 
 support literals and transformations.
 Note that a default strict mapping (i.e. the Solr schema by default MUST 
 support all NutchDocument fields) will be supported for backwards 
 compatibility. I assume this will be what people want.
 I will submit patches in due course.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1736:


Fix Version/s: 1.9
   2.3

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2014-03-16 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937257#comment-13937257
 ] 

Sebastian Nagel commented on NUTCH-1647:


Seems to be a duplicate of NUTCH-1736: {{nutch1.7.patch}} fixes fetch of 
{{http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/}}


 protocol-http throws unzipBestEffort returned null for some pages
 -

 Key: NUTCH-1647
 URL: https://issues.apache.org/jira/browse/NUTCH-1647
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.7
Reporter: Markus Jelsma
 Fix For: 1.9


 bin/nutch indexchecker 
 http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale  
 Fetch failed with protocol status: exception(16), lastModified=0: 
 java.io.IOException: unzipBestEffort returned null
 {code}
 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.host = null
 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.port = 8080
 2013-10-01 13:44:55,612 INFO  http.Http - http.timeout = 12000
 2013-10-01 13:44:55,612 INFO  http.Http - http.content.limit = 5242880
 2013-10-01 13:44:55,612 INFO  http.Http - http.agent = Mozilla/5.0 
 (compatible; OpenindexSpider; 
 +http://www.openindex.io/en/webmasters/spider.html)
 2013-10-01 13:44:55,612 INFO  http.Http - http.accept.language = 
 en-us,en-gb,en;q=0.7,*;q=0.3
 2013-10-01 13:44:55,613 INFO  http.Http - http.accept = 
 text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output
 java.io.IOException: unzipBestEffort returned null
 at 
 org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
 at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:164)
 at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:86)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:150)
 {code}
 Haven't got a clue yet as to what the exact issue is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937269#comment-13937269
 ] 

Sebastian Nagel commented on NUTCH-1736:


Thanks, [~yangshangchuan] for taking the time to analyze the problem. The patch 
also fixes NUTCH-1647.

Any ideas, why {{http.content.limit}} must not be -1?

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[Nutch Wiki] Trivial Update of Release_HOWTO by LewisJohnMcgibbney

2014-03-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Release_HOWTO page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Release_HOWTO?action=diffrev1=29rev2=30

  -rw-rw-r--  1 mary mary  222 Mar  1 14:23 apache-nutch-1.8-src.zip.sha
  }}}
  1. Make sure that '''all artifacts are editable by fellow 
committers''' e.g. chmod 775
- 1. Check out the release management area at 
https://dist.apache.org/repos/dist/dev/nutch/ and copy all artifacts to here 
then commit this.
+ 1. Check out the release management area at 
https://dist.apache.org/repos/dist/dev/nutch/{release.version} and copy all 
artifacts to here then commit this.
  1. Make sure your pgp key is listed in the Nutch KEYS file located at 
http://www.apache.org/dist/nutch/KEYS
- 1. Create and open a VOTE thread on user@ and dev@nutch.apache.org. 
The VOTE must pass with 3 +1 binding VOTE's before any release can take place.
+ 1. Create and open a VOTE thread on user@ and dev@nutch.apache.org. 
The VOTE must pass with 3 +1 binding VOTE's before any release can take place. 
A VOTE thread usually takes the form
+ {{{
+ Hi user@  dev@,
+ 
+ This thread is a VOTE for releasing Apache Nutch 1.8 RC#2. The release 
candidate comprises the following components.
+ 
+ * A staging repository [0] containing various Maven artifacts
+ * A branch-1.8 of the trunk code [1]
+ * The tagged source upon which we are VOTE'ing [2]
+ * Finally, the release artifacts [3] which i would encourage you to verify 
for signatures and test.
+ 
+ You should use the following KEYS [4] file to verify the signatures of all 
release artifacts.
+ 
+ Please VOTE as follows
+ 
+ [ ] +1 Push the release, I am happy :)
+ [ ] +0 I am not bothered either way
+ [ ] -1 I am not happy with this release candidate (please state why)
+ 
+ Firstly thank you to everyone that contributed to Nutch. Secondly, thank you 
to everyone that VOTE's. It is appreciated.
+ 
+ Thanks
+ Lewis
+ (on behalf of Nutch PMC)
+ 
+ p.s. Here's my +1
+  
+ [0] https://repository.apache.org/content/repositories/orgapachenutch-1001/
+ [1] https://svn.apache.org/repos/asf/nutch/branches/branch-1.8
+ [2] https://svn.apache.org/repos/asf/nutch/tags/release-1.8RC%232/
+ [3] https://dist.apache.org/repos/dist/dev/nutch/
+ [4] https://dist.apache.org/repos/dist/dev/nutch/KEYS
+ }}}
+ 1. Once the 72 hour period expires it is time to close the VOTE 
thread with a RESULT thread. This should simply state the outcome of VOTE'ing 
(including how many binding VOTE's were received. Finally it should included 
whether the VOTE passed and if the released can be made.
+ 1. In the instance where the VOTE does not pass, the release manager 
should roll bak all of the work above as well as '''DROP''' the staging 
artifacts.  
  
  == Making the Release ==
+ 1. head back over to the [[https://repository.apache.org/|staging 
repos]] and '''RELEASE''' them into the wild.
+ 1. Move the artifacts from the release management area at 
https://dist.apache.org/repos/dist/dev/nutch/{release.version} to 
https://dist.apache.org/repos/dist/release/nutch/{release.version} as follows 
{{{svn mv https://dist.apache.org/repos/dist/dev/nutch/{release.version} 
https://dist.apache.org/repos/dist/release/nutch/{release.version} --message 
Release Apache Nutch 1.X }}}
1. Wait 24 hours for release to propagate to mirrors.
  1. Add the new release info to the 
[[https://svn.apache.org/repos/asf/nutch/site/publish/doap.rdf|doap.rdf]] file, 
and double check for any other updates that should be made to the doap file as 
well if it hasn't been updated in a while. If this is the case please see 
[[http://projects.apache.org/doap.html|here]]
1. Deploy new Nutch site (according to [[Website_Update_HOWTO]]).


[Nutch Wiki] Trivial Update of Release_HOWTO by LewisJohnMcgibbney

2014-03-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Release_HOWTO page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Release_HOWTO?action=diffrev1=31rev2=32

  
  == Making the Release ==
  1. head back over to the [[https://repository.apache.org/|staging 
repos]] and '''RELEASE''' them into the wild.
- 1. Move the artifacts from the release management area at 
https://dist.apache.org/repos/dist/dev/nutch/{release.version} to 
https://dist.apache.org/repos/dist/release/nutch/{release.version} as follows 
{{{svn mv https://dist.apache.org/repos/dist/dev/nutch/$release.version 
https://dist.apache.org/repos/dist/release/nutch/$release.version --message 
Release Apache Nutch $release.version }}}
+ 1. Move the artifacts from the release management area at 
https://dist.apache.org/repos/dist/dev/nutch/$release.version to 
https://dist.apache.org/repos/dist/release/nutch/$release.version as follows 
{{{svn mv https://dist.apache.org/repos/dist/dev/nutch/$release.version 
https://dist.apache.org/repos/dist/release/nutch/$release.version --message 
Release Apache Nutch $release.version }}}
1. Wait 24 hours for release to propagate to mirrors.
  1. Add the new release info to the 
[[https://svn.apache.org/repos/asf/nutch/site/publish/doap.rdf|doap.rdf]] file, 
and double check for any other updates that should be made to the doap file as 
well if it hasn't been updated in a while. If this is the case please see 
[[http://projects.apache.org/doap.html|here]]
1. Deploy new Nutch site (according to [[Website_Update_HOWTO]]).


[jira] [Comment Edited] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936103#comment-13936103
 ] 

ysc edited comment on NUTCH-1736 at 3/17/14 3:05 AM:
-

problem:
 
fetching: 
http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
unzipBestEffort returned null

detail:

2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output
java.io.IOException: unzipBestEffort returned null
at 
org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:164)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - fetch of 
http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html 
failed with: java.io.IOException: unzipBestEffort returned null
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=0

solution:

this patch deal with http response header Transfer-Encoding:chunked

important tips: 

property http.content.limit in nutch-site.xml must greater than 0

why must greater than 0?

if property http.content.limit in nutch-site.xml is negative or 0, the chunkLen 
is negative or 0 too, see the code below, you can find the code in line 277 of 
java source file 
http://svn.apache.org/repos/asf/nutch/tags/release-1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

  if ( (contentBytesRead + chunkLen)  http.getMaxContent() )
chunkLen= http.getMaxContent() - contentBytesRead;

read one trunk has a condition: 

while (chunkBytesRead  chunkLen)

so, property http.content.limit in nutch-site.xml must greater than 0


was (Author: yangshangchuan):
problem:
 
fetching: 
http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
unzipBestEffort returned null

detail:

2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output
java.io.IOException: unzipBestEffort returned null
at 
org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:164)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - fetch of 
http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html 
failed with: java.io.IOException: unzipBestEffort returned null
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=0

solution:

this patch deal with http response header Transfer-Encoding:chunked

important tips: 

property http.content.limit in nutch-site.xml must greater than 0

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937425#comment-13937425
 ] 

ysc commented on NUTCH-1736:


Hi Sebastian, I have modified the previous comment and added some explain.

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937426#comment-13937426
 ] 

lufeng commented on NUTCH-1736:
---

Hi ysc

you can check the content size to fix this issue like this. 

{code:java}
if (http.getMaxContent() = 0  (contentBytesRead + chunkLen)  
http.getMaxContent() )
  chunkLen= http.getMaxContent() - contentBytesRead;
{code}

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch, nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1736:
---

Attachment: (was: nutch1.7.patch)

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch-2.2.1.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1736:
---

Attachment: nutch1.7.patch

can fix nutch1.x use nutch1.7.patch

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch1.7.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1736:
---

Attachment: nutch2.2.1.patch

can fix nutch2.x use nutch2.2.1.patch

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch1.7.patch, nutch2.2.1.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked

2014-03-16 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937438#comment-13937438
 ] 

ysc commented on NUTCH-1736:


Hi lufeng, thanks, this is a good idea, i have modified the patch files.

 Can't fetch page if http response header contains Transfer-Encoding:chunked
 ---

 Key: NUTCH-1736
 URL: https://issues.apache.org/jira/browse/NUTCH-1736
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
Reporter: ysc
Priority: Critical
 Fix For: 2.3, 1.9

 Attachments: nutch1.7.patch, nutch2.2.1.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 fetching: 
 http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
 Fetch failed with protocol status: EXCEPTION: java.io.IOException: 
 unzipBestEffort returned null



--
This message was sent by Atlassian JIRA
(v6.2#6252)