[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357147#comment-14357147
 ] 

Renxia Wang commented on NUTCH-1957:
------------------------------------

Hi Giuseppe,

About the latter way, is it possible that the url contains special characters 
that cannot be used as part of path/filename? If not, this way should work, 
however it may make the downstream processing complicated, as the user should 
traversal all the paths to get the file. E.g. posting the dump data to Solr. 

I am thinking to get the MD5 of the file content, append it to the end of file 
basename, before the extension, like <basename>-<MD5>.<extension>. Currently, 
the FileDumper use the full path to the output file to calculate the MD5, but 
as the files are storing into the same dir, the MD5 may be the same, which 
still causing file name collision. We may need to use the MD5 of the file 
content. 

As the FileDumper and the CommonCrawlDataDumper using the same way to store 
file, we can make this a util. 

Thanks,

Renxia

> FileDumper output file name collisions
> --------------------------------------
>
>                 Key: NUTCH-1957
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1957
>             Project: Nutch
>          Issue Type: Bug
>          Components: tool
>    Affects Versions: 1.10
>            Reporter: Renxia Wang
>            Priority: Minor
>              Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use 
> <basename>.<extension>(e.g. given the url 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
> <basename>.<extension> will be project.html) as the file name to dump the 
> file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using 
> bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to