setting http.content.limit to -1 seems to break text parsing on some files --------------------------------------------------------------------------
Key: NUTCH-168 URL: http://issues.apache.org/jira/browse/NUTCH-168 Project: Nutch Type: Bug Components: fetcher Versions: 0.7 Environment: Windows 2000 java version "1.4.2_05" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) Reporter: Jerry Russell Setting http.content limit to -1 (which is supposed to mean no limit causes some pages not to index. I have seen this in some PDFs and this one URL in particular. The steps to reproduce are below: Reproduce: 1) install fresh nutch-0.7 2) configure urlfilters to allow any URL 3) create urllist with only the following URL: http://www.circuitsonline.net/circuits/view/71 4) perform a crawl with a depth of 1 5) do segread and see that the content is there 6) change the http.content.limit to -1 in nutch-default.xml 7) repeat the crawl to a new directory 8) do segread and see that the content is not there contact [EMAIL PROTECTED] for more information. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira