Awesome. No one complained because chm is not such popular as a pdf, for instance. In any case, thanks for fixing.
On Sun, Sep 28, 2014 at 11:35 AM, Bin Hawking (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Bin Hawking updated TIKA-1430: > ------------------------------ > Description: > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > > I checked the source code. The cause is obvious: > > When tika decompresses the LZX, the first block is done well, but as to > the 2nd block and later on, Tika uses previous content as the compressed > data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > > """ > if (prevBlock != null > && prevBlock.getState().getBlockLength() > > prevBlock > .getState().getBlockRemaining()) > setChmSection(new ChmSection(prevBlock.getContent())); > // NOTE: the dataSegment to be decompressed is not kept > else > setChmSection(new ChmSection(dataSegment)); > """ > > My fix: > 1. Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > 2. In ChmLzxBlock.extractContent() when invoking > decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of > ChmSection.data. > > Now, I tried some chm files, and got the correct looking texts. > > BTW. The unit test should be tougher, as in this case some small text (the > first block) is decompressed correctly. > > > was: > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > > I checked the source code. The cause is obvious: > > When tika decompresses the LZX, the first block is done well, but as to > the 2nd block and later on, Tika uses previous content as the compressed > data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > > """ > if (prevBlock != null > && prevBlock.getState().getBlockLength() > > prevBlock > .getState().getBlockRemaining()) > setChmSection(new ChmSection(prevBlock.getContent())); > // NOTE: the dataSegment to be decompressed is not kept > else > setChmSection(new ChmSection(dataSegment)); > """ > > My fix: > 1. Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > 2. In ChmLzxBlock.extractContent() when invoking > decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of > ChmSection.data. > > Now, I try some chm files, and got the correct texts. > > BTW. The unit test should be tougher, as in this case some small text (the > first block) is decompressed correctly. > > > > > CHM parser gets faulty text (fix found) > > --------------------------------------- > > > > Key: TIKA-1430 > > URL: https://issues.apache.org/jira/browse/TIKA-1430 > > Project: Tika > > Issue Type: Bug > > Components: parser > > Affects Versions: 1.5, 1.6 > > Environment: Windows 7; JDK 7 or 8 > > Reporter: Bin Hawking > > Priority: Critical > > > > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > > I checked the source code. The cause is obvious: > > When tika decompresses the LZX, the first block is done well, but as to > the 2nd block and later on, Tika uses previous content as the compressed > data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > > """ > > if (prevBlock != null > > && prevBlock.getState().getBlockLength() > > prevBlock > > .getState().getBlockRemaining()) > > setChmSection(new > ChmSection(prevBlock.getContent())); > > // NOTE: the dataSegment to be decompressed is not kept > > else > > setChmSection(new ChmSection(dataSegment)); > > """ > > My fix: > > 1. Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > > 2. In ChmLzxBlock.extractContent() when invoking > decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of > ChmSection.data. > > Now, I tried some chm files, and got the correct looking texts. > > BTW. The unit test should be tougher, as in this case some small text > (the first block) is decompressed correctly. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >