[jira] [Resolved] (NUTCH-1963) CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked

Lewis John McGibbney (JIRA) Thu, 23 Apr 2015 16:38:19 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney resolved NUTCH-1963.
-----------------------------------------
    Resolution: Fixed
      Assignee: Giuseppe Totaro

Addressed within NUTCH-1959
Thank you [~gostep]

> CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1963
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1963
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl
>    Affects Versions: 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Giuseppe Totaro
>             Fix For: 1.10
>
>
> When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
> application/pdf* I get the following stack trace which results in a failure 
> of the task
> {code}
> java.lang.RuntimeException: file name 
> 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
>  is too long ( > 100 bytes)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
>       at 
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
>       at 
> org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
>       at 
> org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
> {code}
> The workaround consists of not using the *-gzip* option, instead delaying 
> this until a later task, however this is a workaround and not a solution.
> We need to fix this in order for the tool to work as designed and required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-1963) CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked

Reply via email to