[ https://issues.apache.org/jira/browse/NUTCH-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on NUTCH-2250 started by Chris A. Mattmann. ------------------------------------------------ > CommonCrawlDumper : Invalid format + skipped parts > --------------------------------------------------- > > Key: NUTCH-2250 > URL: https://issues.apache.org/jira/browse/NUTCH-2250 > Project: Nutch > Issue Type: Sub-task > Components: commoncrawl > Affects Versions: 1.12 > Environment: Linux x64 > Java 7 > Nutch 1.12 > Reporter: Thamme Gowda N > Assignee: Chris A. Mattmann > Fix For: 1.10 > > > The following issues are found with CommonCrawlDumper; > 1. Documents get duplicated in dump files > How to reproduce > {code} > bin/nutch commoncrawldump -segment .../segments -outputDir testdump > -SimpleDateFormat -epochFilename -jsonArray -reverseKey > {code} > The first ever written will contain 1 document. > second file includes two documents > third file includes first three documents and this grows linearly. > 2.If a segment has many parts (part-00000, part-00001,...) only the first > part (part-00000 ) is being dumped > How to reproduce ? > Create segment with two parts (part-00000 and part-00001) -- This message was sent by Atlassian JIRA (v6.3.4#6332)