[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

Hong-Thai Nguyen (JIRA) Mon, 24 Nov 2014 01:10:39 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hong-Thai Nguyen resolved TIKA-1430.
------------------------------------
    Resolution: Fixed

> CHM parser gets faulty text (fix found)
> ---------------------------------------
>
>                 Key: TIKA-1430
>                 URL: https://issues.apache.org/jira/browse/TIKA-1430
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6
>         Environment: Windows 7; JDK 7 or 8
>            Reporter: Bin Hawking
>            Priority: Critical
>             Fix For: 1.7
>
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
>                 if (prevBlock != null
>                         && prevBlock.getState().getBlockLength() > prevBlock
>                                 .getState().getBlockRemaining())
>                     setChmSection(new ChmSection(prevBlock.getContent()));
> //                   NOTE: the dataSegment to be decompressed is not kept
>                 else
>                     setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.    Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.    In ChmLzxBlock.extractContent() when invoking decompressXXXXBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I tried some chm files, and got the correct looking texts. 
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

Reply via email to