[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346320#comment-14346320 ] Hudson commented on NUTCH-1950: --- SUCCESS: Integrated in Nutch-trunk #2999 (See [https://builds.apache.org/job/Nutch-trunk/2999/]) Fix for NUTCH-1950 File name too long contributed by xzjh and Chong Li. This closes #9. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1663847) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346254#comment-14346254 ] ASF GitHub Bot commented on NUTCH-1950: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/9 > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338684#comment-14338684 ] Sebastian Nagel commented on NUTCH-1950: Great! For a MD5 calculation, see o.a.hadoop.io.MD5Hash (example usage in src/java/org/apache/nutch/crawl/TextMD5Signature.java). Since a MD5 sum should guarantee a unique name: why not remove/replace ugly characters from the prefix at all? They may also cause errors if not allowed by the file system. E.g., {noformat} http://en.wikipedia.org/wiki/$100 -> http_en_wikipedia_org_wiki_100_d7a09ded039d2833ff602ac9d4cd5a8d http://en.wikipedia.org/wiki/100-> http_en_wikipedia_org_wiki_100_483a8ae86d3af6b656cdb3ec67753c24 {noformat} > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338211#comment-14338211 ] Chong Li commented on NUTCH-1950: - I have thought about that and at first we just wanted every new filename to be unique. I tried to save the exact 255 characters and 128 characters as the filename before and the new url was still not human readable because there were a lot of random characters in it.. and that is the reason why those filenames are so long I think it is a good idea to save the first 32 characters or just save the domain name, and then plus a unique key. Thanks for the advice! I will change my solution! > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338198#comment-14338198 ] Sebastian Nagel commented on NUTCH-1950: Is it really a good idea to take the system time as fall-back file name? Could take e.g. the first 32 characters (for human readability) plus the MD5 of the filename/URL: this would make the filename predictable and constant over time. > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338066#comment-14338066 ] Chong Li commented on NUTCH-1950: - Hello Professor, My teammate has submitted a pull request to fix the bug. Also for the tika-img-similarity project, the parser using python can not calculate the files other than jpgs, so I wrote a java class to calculate the scores of all the other file formats using tika. I will open a github repo and hope that will help. Best, Chong On Wed, Feb 25, 2015 at 10:35 PM, Chris A. Mattmann (JIRA) > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338029#comment-14338029 ] ASF GitHub Bot commented on NUTCH-1950: --- GitHub user xzjh opened a pull request: https://github.com/apache/nutch/pull/9 fix for NUTCH-1950 contributed by xzjh It is the fix for this issue: https://issues.apache.org/jira/browse/NUTCH-1950 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xzjh/nutch NUTCH-1950 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/9.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9 commit f77873d621a1c8bae364b08695c6cf8aa25be0e8 Author: xzjh Date: 2015-02-26T06:36:01Z fix for NUTCH-1950 contributed by xzjh > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337980#comment-14337980 ] Chris A. Mattmann commented on NUTCH-1950: -- thanks Chong. Please attach a patch if you have one. Thank you! > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)