[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-03-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346320#comment-14346320
 ] 

Hudson commented on NUTCH-1950:
---

SUCCESS: Integrated in Nutch-trunk #2999 (See 
[https://builds.apache.org/job/Nutch-trunk/2999/])
Fix for NUTCH-1950 File name too long contributed by xzjh  
and Chong Li. This closes #9. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1663847)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java


> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346254#comment-14346254
 ] 

ASF GitHub Bot commented on NUTCH-1950:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/9


> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338684#comment-14338684
 ] 

Sebastian Nagel commented on NUTCH-1950:


Great! For a MD5 calculation, see o.a.hadoop.io.MD5Hash (example usage in 
src/java/org/apache/nutch/crawl/TextMD5Signature.java). Since a MD5 sum should 
guarantee a unique name: why not remove/replace ugly characters from the prefix 
at all? They may also cause errors if not allowed by the file system. E.g.,
{noformat}
 http://en.wikipedia.org/wiki/$100   ->  
http_en_wikipedia_org_wiki_100_d7a09ded039d2833ff602ac9d4cd5a8d
 http://en.wikipedia.org/wiki/100->  
http_en_wikipedia_org_wiki_100_483a8ae86d3af6b656cdb3ec67753c24
{noformat}


> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Chong Li (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338211#comment-14338211
 ] 

Chong Li commented on NUTCH-1950:
-

I have thought about that and at first we just wanted every new filename to be 
unique. 

I tried to save the exact 255 characters and 128 characters as the filename 
before and the new url was still not human readable because there were a lot of 
random characters in it.. and that is the reason why those filenames are so 
long

I think it is a good idea to save the first 32 characters or just save the 
domain name, and then plus a unique key. 
Thanks for the advice! I will change my solution!

> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338198#comment-14338198
 ] 

Sebastian Nagel commented on NUTCH-1950:


Is it really a good idea to take the system time as fall-back file name? Could 
take e.g. the first 32 characters (for human readability) plus the MD5 of the 
filename/URL: this would make the filename predictable and constant over time.

> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-25 Thread Chong Li (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338066#comment-14338066
 ] 

Chong Li commented on NUTCH-1950:
-

Hello Professor,
My teammate has submitted a pull request to fix the bug.
Also for the tika-img-similarity project, the parser using python can not
calculate the files other than jpgs, so I wrote a java class to calculate
the scores of all the other file formats using tika. I will open a github
repo and hope that will help.

Best,
Chong

On Wed, Feb 25, 2015 at 10:35 PM, Chris A. Mattmann (JIRA) 



> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338029#comment-14338029
 ] 

ASF GitHub Bot commented on NUTCH-1950:
---

GitHub user xzjh opened a pull request:

https://github.com/apache/nutch/pull/9

fix for NUTCH-1950 contributed by xzjh

It is the fix for this issue: 
https://issues.apache.org/jira/browse/NUTCH-1950

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xzjh/nutch NUTCH-1950

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/9.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9


commit f77873d621a1c8bae364b08695c6cf8aa25be0e8
Author: xzjh 
Date:   2015-02-26T06:36:01Z

fix for NUTCH-1950 contributed by xzjh




> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337980#comment-14337980
 ] 

Chris A. Mattmann commented on NUTCH-1950:
--

thanks Chong. Please attach a patch if you have one. Thank you!

> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)