[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362208#comment-14362208 ]
Hudson commented on NUTCH-1957: ------------------------------- SUCCESS: Integrated in Nutch-trunk #3017 (See [https://builds.apache.org/job/Nutch-trunk/3017/]) Fix for NUTCH-1957 FileDumper output file name collisions contributed by Renxia Wang this closes #12 (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1666777) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java * /nutch/trunk/src/java/org/apache/nutch/util/DumpFileUtil.java * /nutch/trunk/src/test/org/apache/nutch/util/DumpFileUtilTest.java > FileDumper output file name collisions > -------------------------------------- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool > Affects Versions: 1.10 > Reporter: Renxia Wang > Assignee: Chris A. Mattmann > Priority: Minor > Labels: dumper, filename, tools > Fix For: 1.10 > > Attachments: NUTCH-1957.patch > > > The FileDumper extracts file base name and extension and use > <basename>.<extension>(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > <basename>.<extension> will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)