Thamme Gowda N created NUTCH-2250: ------------------------------------- Summary: CommonCrawlDumper : Invalid format + skipped parts Key: NUTCH-2250 URL: https://issues.apache.org/jira/browse/NUTCH-2250 Project: Nutch Issue Type: Sub-task Components: commoncrawl Affects Versions: 1.12 Environment: Linux x64 Java 7 Nutch 1.12 Reporter: Thamme Gowda N
The following issues are found with CommonCrawlDumper; 1. Documents get duplicated in dump files How to reproduce {code} bin/nutch commoncrawldump -segment .../segments -outputDir testdump -SimpleDateFormat -epochFilename -jsonArray -reverseKey {code} The first ever written will contain 1 document. second file includes two documents third file includes first three documents and this grows linearly. 2.If a segment has many parts (part-00000, part-00001,...) only the first part (part-00000 ) is being dumped How to reproduce ? Create segment with two parts (part-00000 and part-00001) -- This message was sent by Atlassian JIRA (v6.3.4#6332)