[ https://issues.apache.org/jira/browse/NUTCH-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann resolved NUTCH-2250. -------------------------------------- Resolution: Fixed Fix Version/s: (was: 1.10) 1.12 - merged this into master thanks [~thammegowda] and [~lewismc]! {noformat} LMC-053601:nutch1.12 mattmann$ git push -u origin master Counting objects: 13, done. Delta compression using up to 8 threads. Compressing objects: 100% (11/11), done. Writing objects: 100% (13/13), 1.84 KiB | 0 bytes/s, done. Total 13 (delta 8), reused 0 (delta 0) remote: nutch git commit: Record changes for NUTCH-2250. remote: nutch git commit: NUTCH-2250 : CommonCrawlDumper : Invalid format and skipped parts To https://git-wip-us.apache.org/repos/asf/nutch.git b62f43f..d6bcefd master -> master Branch master set up to track remote branch master from origin. LMC-053601:nutch1.12 mattmann$ {noformat} > CommonCrawlDumper : Invalid format + skipped parts > --------------------------------------------------- > > Key: NUTCH-2250 > URL: https://issues.apache.org/jira/browse/NUTCH-2250 > Project: Nutch > Issue Type: Sub-task > Components: commoncrawl > Affects Versions: 1.10 > Environment: Linux x64 > Java 7 > Nutch 1.12 > Reporter: Thamme Gowda N > Assignee: Chris A. Mattmann > Fix For: 1.12 > > > The following issues are found with CommonCrawlDumper; > 1. Documents get duplicated in dump files > How to reproduce > {code} > bin/nutch commoncrawldump -segment .../segments -outputDir testdump > -SimpleDateFormat -epochFilename -jsonArray -reverseKey > {code} > The first ever written will contain 1 document. > second file includes two documents > third file includes first three documents and this grows linearly. > 2.If a segment has many parts (part-00000, part-00001,...) only the first > part (part-00000 ) is being dumped > How to reproduce ? > Create segment with two parts (part-00000 and part-00001) -- This message was sent by Atlassian JIRA (v6.3.4#6332)