[jira] [Comment Edited] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079
 ] 

Hong-Thai Nguyen edited comment on TIKA-1446 at 11/12/14 2:38 PM:
--

Hi [~binhawking], I've merged your contribution and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_  a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?


was (Author: thaichat04):
Hi [~binhawking], I've merge your pull request and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_  a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-25 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184256#comment-14184256
 ] 

Bin Hawking edited comment on TIKA-1446 at 10/25/14 8:27 PM:
-

Hong-Thai Nguyen and all, 

I have 
1. added a test case to validate HTMLs (hhk and hhc also) extracted from 
testChm*; 
2. removed a try..catch {} which hides some defects; 
3. throw exceptions when something overflows or does not work, instead of 
quitting the method silently.

Then I found more bugs and fixed them, all unit tests passed.

I will push my revision to github later. Once done I will drop a message here. 


was (Author: binhawking):
Hong-Thai Nguyen and all, 

I have 
1. added a test case to validate HTMLs (hhk and hhc also) extracted from 
testChm*; 
2. removed a try..catch {} which hides some defects; 
3. throw an exception when something overflow or does not work, instead of 
quitting the method silently.

Then I found more bugs and fixed them, all unit tests passed.

I will push my revision to github later. Once done I will drop a message here. 

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)