[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357147#comment-14357147 ]
Renxia Wang commented on NUTCH-1957: ------------------------------------ Hi Giuseppe, About the latter way, is it possible that the url contains special characters that cannot be used as part of path/filename? If not, this way should work, however it may make the downstream processing complicated, as the user should traversal all the paths to get the file. E.g. posting the dump data to Solr. I am thinking to get the MD5 of the file content, append it to the end of file basename, before the extension, like <basename>-<MD5>.<extension>. Currently, the FileDumper use the full path to the output file to calculate the MD5, but as the files are storing into the same dir, the MD5 may be the same, which still causing file name collision. We may need to use the MD5 of the file content. As the FileDumper and the CommonCrawlDataDumper using the same way to store file, we can make this a util. Thanks, Renxia > FileDumper output file name collisions > -------------------------------------- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool > Affects Versions: 1.10 > Reporter: Renxia Wang > Priority: Minor > Labels: dumper, filename, tools > > The FileDumper extracts file base name and extension and use > <basename>.<extension>(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > <basename>.<extension> will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)