[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217742#comment-14217742 ] Hudson commented on TIKA-1446: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #302 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/302/]) Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640520) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java TIKA-1446: Updated test so it loads the test documents from the classpath (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640518) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217786#comment-14217786 ] Hudson commented on TIKA-1446: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #322 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/322/]) Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640520) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java TIKA-1446: Updated test so it loads the test documents from the classpath (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640518) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214560#comment-14214560 ] Hudson commented on TIKA-1446: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #318 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/318/]) TIKA-1446 - Revert CRLF on profile language files (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640139) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/be.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/ca.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/eo.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/gl.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/ro.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/sk.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/sl.ngp * /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/uk.ngp CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212609#comment-14212609 ] Bin Hawking commented on TIKA-1446: --- Hong-Thai Nguyen, I updated the code, see https://github.com/binhawking/tika/compare/thaichat04:trunk...trunk After this update, your test case passed. And I added your chm files to the test file list, but I found that 3 files (IM*.CHM) contains bad html pages without /html, so their test will fail. It is not a problem. CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207922#comment-14207922 ] Chris A. Mattmann commented on TIKA-1446: - Hi guys, what is the status on this? Is this ready to be merged? CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079 ] Hong-Thai Nguyen commented on TIKA-1446: Hi [~binhawking], I've merge your pull request and make title comparison before/after on a local corpus of CHM files. Before merge, we have only one failed file, after merge we have 10 failed files. I've pushed failed CHM files under _test-documents/chm_ a checking test case into: https://github.com/thaichat04/tika I made also some clean-up. Any chance you have a look again ? CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184464#comment-14184464 ] Bin Hawking commented on TIKA-1446: --- I created a pull request: https://github.com/thaichat04/tika/pull/1 https://github.com/binhawking/tika CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184256#comment-14184256 ] Bin Hawking commented on TIKA-1446: --- Hong-Thai Nguyen and all, I have 1. added a test case to validate HTMLs (hhk and hhc also) extracted from testChm*; 2. removed a try..catch {} which hides some defects; 3. throw an exception when something overflow or does not work, instead of quitting the method silently. Then I found more bugs and fixed them, all unit tests passed. I will push my revision to github later. Once done I will drop a message here. CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181483#comment-14181483 ] ASF GitHub Bot commented on TIKA-1446: -- GitHub user thaichat04 opened a pull request: https://github.com/apache/tika/pull/20 TIKA-1446 TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika 1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/20.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20 commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca Author: Chris Mattmann mattm...@apache.org Date: 2014-07-28T00:45:03Z [maven-release-plugin] copy for tag 1.6 git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 13f79535-47bb-0310-9956-ffa450edef68 commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a Author: David Meikle dmei...@apache.org Date: 2014-07-31T18:29:32Z TIKA-1381 - Added Lingo24Translator implementation git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 13f79535-47bb-0310-9956-ffa450edef68 commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9 Author: Nick Burch n...@apache.org Date: 2014-08-04T15:41:54Z Create a branch for 1.6, to backport the POI upgrade to git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 13f79535-47bb-0310-9956-ffa450edef68 commit e2d10e633d38c52b0f490a09043fb43176d26fbe Author: Nick Burch n...@apache.org Date: 2014-08-04T15:54:55Z Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), ready for inclusion in rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 13f79535-47bb-0310-9956-ffa450edef68 commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c Author: Tim Allison talli...@apache.org Date: 2014-08-04T16:51:40Z TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) files git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 13f79535-47bb-0310-9956-ffa450edef68 commit 68f9a11926946bdea29ab757a8275149d8d057e9 Author: Nick Burch n...@apache.org Date: 2014-08-04T21:27:41Z Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to match that in Apache POI, upgraded in TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 13f79535-47bb-0310-9956-ffa450edef68 commit ee988d4daa5b451a51b799b0ec790b88ca7fc111 Author: Tim Allison talli...@apache.org Date: 2014-08-05T13:03:05Z TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 13f79535-47bb-0310-9956-ffa450edef68 commit 9d27e1379fba530def45b470a92ce5052078021c Author: Tim Allison talli...@apache.org Date: 2014-08-05T18:17:39Z TIKA-1380; fix for null ole.getLabel() git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 13f79535-47bb-0310-9956-ffa450edef68 commit 2ee02d85aa703e65607a707ee171c166017916ab Author: Nick Burch n...@apache.org Date: 2014-08-20T14:16:06Z Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no longer required by anything now we are on Java 1.6 TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 13f79535-47bb-0310-9956-ffa450edef68 commit a3eac367cd560c20da4231f45eb18d638d4f91a1 Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:36:36Z Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 13f79535-47bb-0310-9956-ffa450edef68 commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:44:11Z [maven-release-plugin] prepare release 1.6-rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 13f79535-47bb-0310-9956-ffa450edef68 commit 5f9845759fb7839298ac5ee3abb11667035faac3 Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:44:17Z [maven-release-plugin] prepare for next development iteration git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 13f79535-47bb-0310-9956-ffa450edef68 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority:
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181518#comment-14181518 ] ASF GitHub Bot commented on TIKA-1446: -- Github user thaichat04 closed the pull request at: https://github.com/apache/tika/pull/20 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181530#comment-14181530 ] Hong-Thai Nguyen commented on TIKA-1446: Thank alot [~binhawking], I've quick look on your fix. Effectually, there's quite a lot of changes. After cleanup fix some minor, I broke CHM tests. We appreciate really your contribution and we should continue finalize. I've created new pull request basing on a branch with your fix + my cleanup: https://github.com/apache/tika/pull/21 https://github.com/thaichat04/tika.git, branch TIKA-1446 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178889#comment-14178889 ] Bin Hawking commented on TIKA-1446: --- The above attached is my fix, which is the old or new code in tika-1.6\tika-parsers\src\main\java\org\apache\tika\parser\chm\ Please use diff to see my changes. This fix addresses TIKA- 1430, 1446, 1447, 1448. NOTE: My fix is not well tested and may be incomplete. And, because I am adding new features to the chm parser for my own application,including parsing HHK and HHC files for more metadata; there are some distractions in my revisions which are not applicable to the original tika project. Sorry for the inconvenience. CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)