[ https://issues.apache.org/jira/browse/TIKA-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821697#comment-17821697 ]
Tim Allison commented on TIKA-4204: ----------------------------------- Ugh. I accidentally pushed to main instead of a dev branch. Sorry. > ChmExtractor unable to decompress file > -------------------------------------- > > Key: TIKA-4204 > URL: https://issues.apache.org/jira/browse/TIKA-4204 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.1, 3.0.0-BETA > Environment: The file I am trying to parse is attached, the file > being found as the content file is "/CSS/ABBContent.css" > Reporter: Robert Fromholz > Assignee: Tim Allison > Priority: Blocker > Attachments: 3HAC050917_TRM_RAPID_RW_6-en.chm > > > ChmExtractor fails with error: "TikaException: can't copy beyond array > length" when calling extractChmEntry on any non-empty entry. > Upon inspection this turns out to be caused by lzxBlockOffset being > incorrectly set. > This is caused by the method ChmExtractor#getIndexOfContent returing the > wrong entry. > This is because ChmCommons#indexOf(List, String) returns the first entry with > a name containing the string "Content". The file I am trying to parse > contains a file with the name Content.css, which is the entry returned by > #indexOf(...), instead of the actual content entry. > To fix the issue, ChmCommons#indexOf(...) should be more strict in how it > detects the content entry. > According to: [http://www.russotto.net/chm/chmformat.html], the name of the > content entry will always start with "::DataSpace/Storage/", which could be > used to restrict it to find the correct entry. -- This message was sent by Atlassian Jira (v8.20.10#820010)