[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362208#comment-14362208
 ] 

Hudson commented on NUTCH-1957:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #3017 (See 
[https://builds.apache.org/job/Nutch-trunk/3017/])
Fix for NUTCH-1957 FileDumper output file name collisions contributed by Renxia 
Wang this closes #12 (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1666777)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java
* /nutch/trunk/src/java/org/apache/nutch/util/DumpFileUtil.java
* /nutch/trunk/src/test/org/apache/nutch/util/DumpFileUtilTest.java


> FileDumper output file name collisions
> --------------------------------------
>
>                 Key: NUTCH-1957
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1957
>             Project: Nutch
>          Issue Type: Bug
>          Components: tool
>    Affects Versions: 1.10
>            Reporter: Renxia Wang
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: dumper, filename, tools
>             Fix For: 1.10
>
>         Attachments: NUTCH-1957.patch
>
>
> The FileDumper extracts file base name and extension and use 
> <basename>.<extension>(e.g. given the url 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
> <basename>.<extension> will be project.html) as the file name to dump the 
> file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using 
> bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to