[ https://issues.apache.org/jira/browse/NUTCH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578975#action_12578975 ]
Andrzej Bialecki commented on NUTCH-168: ----------------------------------------- This branch has an End Of Life status. I believe this issue is fixed in recent branches. > setting http.content.limit to -1 seems to break text parsing on some files > -------------------------------------------------------------------------- > > Key: NUTCH-168 > URL: https://issues.apache.org/jira/browse/NUTCH-168 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.7 > Environment: Windows 2000 > java version "1.4.2_05" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) > Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) > Reporter: Jerry Russell > > Setting http.content limit to -1 (which is supposed to mean no limit causes > some pages not to index. I have seen this in some PDFs and this one URL in > particular. The steps to reproduce are below: > Reproduce: > 1) install fresh nutch-0.7 > 2) configure urlfilters to allow any URL > 3) create urllist with only the following URL: > http://www.circuitsonline.net/circuits/view/71 > 4) perform a crawl with a depth of 1 > 5) do segread and see that the content is there > 6) change the http.content.limit to -1 in nutch-default.xml > 7) repeat the crawl to a new directory > 8) do segread and see that the content is not there > contact [EMAIL PROTECTED] for more information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.