[jira] [Assigned] (TIKA-4204) ChmExtractor unable to decompress file

Tim Allison (Jira) Tue, 27 Feb 2024 12:05:18 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison reassigned TIKA-4204:
---------------------------------

    Assignee: Tim Allison

> ChmExtractor unable to decompress file
> --------------------------------------
>
>                 Key: TIKA-4204
>                 URL: https://issues.apache.org/jira/browse/TIKA-4204
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.1, 3.0.0-BETA
>         Environment: The file I am trying to parse is attached, the file 
> being found as the content file is "/CSS/ABBContent.css"
>            Reporter: Robert Fromholz
>            Assignee: Tim Allison
>            Priority: Blocker
>         Attachments: 3HAC050917_TRM_RAPID_RW_6-en.chm
>
>
> ChmExtractor fails with error: "TikaException: can't copy beyond array 
> length" when calling extractChmEntry on any non-empty entry. 
> Upon inspection this turns out to be caused by lzxBlockOffset being 
> incorrectly set.
> This is caused by the method ChmExtractor#getIndexOfContent returing the 
> wrong entry.
> This is because ChmCommons#indexOf(List, String) returns the first entry with 
> a name containing the string "Content". The file I am trying to parse 
> contains a file with the name Content.css, which is the entry returned by 
> #indexOf(...), instead of the actual content entry.
> To fix the issue, ChmCommons#indexOf(...) should be more strict in how it 
> detects the content entry.
> According to: [http://www.russotto.net/chm/chmformat.html], the name of the 
> content entry will always start with "::DataSpace/Storage/", which could be 
> used to restrict it to find the correct entry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (TIKA-4204) ChmExtractor unable to decompress file

Reply via email to