[ https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney resolved NUTCH-1963. ----------------------------------------- Resolution: Fixed Assignee: Giuseppe Totaro Addressed within NUTCH-1959 Thank you [~gostep] > CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked > --------------------------------------------------------------------------- > > Key: NUTCH-1963 > URL: https://issues.apache.org/jira/browse/NUTCH-1963 > Project: Nutch > Issue Type: Bug > Components: commoncrawl > Affects Versions: 1.10 > Reporter: Lewis John McGibbney > Assignee: Giuseppe Totaro > Fix For: 1.10 > > > When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype > application/pdf* I get the following stack trace which results in a failure > of the task > {code} > java.lang.RuntimeException: file name > 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' > is too long ( > 100 bytes) > at > org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) > at > org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) > at > org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) > at > org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) > {code} > The workaround consists of not using the *-gzip* option, instead delaying > this until a later task, however this is a workaround and not a solution. > We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)