Re: [jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

Oleg Tikhonov Sun, 28 Sep 2014 01:55:54 -0700

Awesome. No one complained because chm is not such popular as a pdf, for
instance.
In any case, thanks for fixing.


On Sun, Sep 28, 2014 at 11:35 AM, Bin Hawking (JIRA) <[email protected]>
wrote:

>
>      [
> https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Bin Hawking updated TIKA-1430:
> ------------------------------
>     Description:
> Get partially wrong text out of a CHM file, including the chm files in
> tika-parsers/src/test/resources/test-documents/testChm*.chm
>
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before?
>
> I checked the source code. The cause is obvious:
>
> When tika decompresses the LZX, the first block is done well, but as to
> the 2nd block and later on, Tika uses previous content as the compressed
> data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
>
> """
>                 if (prevBlock != null
>                         && prevBlock.getState().getBlockLength() >
> prevBlock
>                                 .getState().getBlockRemaining())
>                     setChmSection(new ChmSection(prevBlock.getContent()));
> //                   NOTE: the dataSegment to be decompressed is not kept
>                 else
>                     setChmSection(new ChmSection(dataSegment));
> """
>
> My fix:
> 1.      Add a prevcontent member variable in ChmSection class, so that
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.      In ChmLzxBlock.extractContent() when invoking
> decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of
> ChmSection.data.
>
> Now, I tried some chm files, and got the correct looking texts.
>
> BTW. The unit test should be tougher, as in this case some small text (the
> first block) is decompressed correctly.
>
>
>   was:
> Get partially wrong text out of a CHM file, including the chm files in
> tika-parsers/src/test/resources/test-documents/testChm*.chm
>
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before?
>
> I checked the source code. The cause is obvious:
>
> When tika decompresses the LZX, the first block is done well, but as to
> the 2nd block and later on, Tika uses previous content as the compressed
> data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
>
> """
>                 if (prevBlock != null
>                         && prevBlock.getState().getBlockLength() >
> prevBlock
>                                 .getState().getBlockRemaining())
>                     setChmSection(new ChmSection(prevBlock.getContent()));
> //                   NOTE: the dataSegment to be decompressed is not kept
>                 else
>                     setChmSection(new ChmSection(dataSegment));
> """
>
> My fix:
> 1.      Add a prevcontent member variable in ChmSection class, so that
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.      In ChmLzxBlock.extractContent() when invoking
> decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of
> ChmSection.data.
>
> Now, I try some chm files, and got the correct texts.
>
> BTW. The unit test should be tougher, as in this case some small text (the
> first block) is decompressed correctly.
>
>
>
> > CHM parser gets faulty text (fix found)
> > ---------------------------------------
> >
> >                 Key: TIKA-1430
> >                 URL: https://issues.apache.org/jira/browse/TIKA-1430
> >             Project: Tika
> >          Issue Type: Bug
> >          Components: parser
> >    Affects Versions: 1.5, 1.6
> >         Environment: Windows 7; JDK 7 or 8
> >            Reporter: Bin Hawking
> >            Priority: Critical
> >
> > Get partially wrong text out of a CHM file, including the chm files in
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before?
> > I checked the source code. The cause is obvious:
> > When tika decompresses the LZX, the first block is done well, but as to
> the 2nd block and later on, Tika uses previous content as the compressed
> data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> > """
> >                 if (prevBlock != null
> >                         && prevBlock.getState().getBlockLength() >
> prevBlock
> >                                 .getState().getBlockRemaining())
> >                     setChmSection(new
> ChmSection(prevBlock.getContent()));
> > //                   NOTE: the dataSegment to be decompressed is not kept
> >                 else
> >                     setChmSection(new ChmSection(dataSegment));
> > """
> > My fix:
> > 1.    Add a prevcontent member variable in ChmSection class, so that
> dataSegment and prevBlock.getContent() are both kept in it.
> > 2.    In ChmLzxBlock.extractContent() when invoking
> decompressXXXXBlock(), pass ChmSection.prevcontent if exists, instead of
> ChmSection.data.
> > Now, I tried some chm files, and got the correct looking texts.
> > BTW. The unit test should be tougher, as in this case some small text
> (the first block) is decompressed correctly.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: [jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

Reply via email to