[ https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hong-Thai Nguyen resolved TIKA-1430. ------------------------------------ Resolution: Fixed > CHM parser gets faulty text (fix found) > --------------------------------------- > > Key: TIKA-1430 > URL: https://issues.apache.org/jira/browse/TIKA-1430 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.5, 1.6 > Environment: Windows 7; JDK 7 or 8 > Reporter: Bin Hawking > Priority: Critical > Fix For: 1.7 > > > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > I checked the source code. The cause is obvious: > When tika decompresses the LZX, the first block is done well, but as to the > 2nd block and later on, Tika uses previous content as the compressed data. > see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > """ > if (prevBlock != null > && prevBlock.getState().getBlockLength() > prevBlock > .getState().getBlockRemaining()) > setChmSection(new ChmSection(prevBlock.getContent())); > // NOTE: the dataSegment to be decompressed is not kept > else > setChmSection(new ChmSection(dataSegment)); > """ > My fix: > 1. Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > 2. In ChmLzxBlock.extractContent() when invoking decompressXXXXBlock(), > pass ChmSection.prevcontent if exists, instead of ChmSection.data. > Now, I tried some chm files, and got the correct looking texts. > BTW. The unit test should be tougher, as in this case some small text (the > first block) is decompressed correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)