[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358439#comment-14358439 ]
Renxia Wang commented on NUTCH-1957: ------------------------------------ Hi Sebastian, Thank you for your suggestions. Based on your comment, I resolve this issue and sent a pull request here: https://github.com/apache/nutch/pull/12 > FileDumper output file name collisions > -------------------------------------- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool > Affects Versions: 1.10 > Reporter: Renxia Wang > Priority: Minor > Labels: dumper, filename, tools > > The FileDumper extracts file base name and extension and use > <basename>.<extension>(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > <basename>.<extension> will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)