[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217742#comment-14217742
 ] 

Hudson commented on TIKA-1446:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #302 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/302/])
Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640520)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
TIKA-1446: Updated test so it loads the test documents from the classpath 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640518)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217786#comment-14217786
 ] 

Hudson commented on TIKA-1446:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #322 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/322/])
Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640520)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
TIKA-1446: Updated test so it loads the test documents from the classpath 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640518)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214560#comment-14214560
 ] 

Hudson commented on TIKA-1446:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #318 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/318/])
TIKA-1446 - Revert CRLF on profile language files (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1640139)
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/be.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/ca.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/eo.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/gl.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/ro.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/sk.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/sl.ngp
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/language/uk.ngp


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-14 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212609#comment-14212609
 ] 

Bin Hawking commented on TIKA-1446:
---

Hong-Thai Nguyen,

I updated the code, see 
https://github.com/binhawking/tika/compare/thaichat04:trunk...trunk

After this update, your test case passed. And I added your chm files to the 
test file list, but I found that 3 files (IM*.CHM) contains bad html pages 
without /html, so their test will fail. It is not a problem. 

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207922#comment-14207922
 ] 

Chris A. Mattmann commented on TIKA-1446:
-

Hi guys, what is the status on this? Is this ready to be merged?

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Hi [~binhawking], I've merge your pull request and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_  a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-26 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184464#comment-14184464
 ] 

Bin Hawking commented on TIKA-1446:
---

I created a pull request:
https://github.com/thaichat04/tika/pull/1
https://github.com/binhawking/tika

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-25 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184256#comment-14184256
 ] 

Bin Hawking commented on TIKA-1446:
---

Hong-Thai Nguyen and all, 

I have 
1. added a test case to validate HTMLs (hhk and hhc also) extracted from 
testChm*; 
2. removed a try..catch {} which hides some defects; 
3. throw an exception when something overflow or does not work, instead of 
quitting the method silently.

Then I found more bugs and fixed them, all unit tests passed.

I will push my revision to github later. Once done I will drop a message here. 

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181483#comment-14181483
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann mattm...@apache.org
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle dmei...@apache.org
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison talli...@apache.org
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch n...@apache.org
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: 

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181518#comment-14181518
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

Github user thaichat04 closed the pull request at:

https://github.com/apache/tika/pull/20


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181530#comment-14181530
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank alot [~binhawking], I've quick look on your fix. Effectually, there's 
quite a lot of changes. After cleanup  fix some minor, I broke CHM tests.

We appreciate really your contribution and we should continue  finalize. I've 
created new pull request basing on a branch with your fix + my cleanup:
https://github.com/apache/tika/pull/21
https://github.com/thaichat04/tika.git, branch TIKA-1446

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-21 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178889#comment-14178889
 ] 

Bin Hawking commented on TIKA-1446:
---

The above attached is my fix, which is the old or new code in 
tika-1.6\tika-parsers\src\main\java\org\apache\tika\parser\chm\

Please use diff to see my changes. 

This fix addresses TIKA- 1430, 1446, 1447, 1448.

NOTE: My fix is not well tested and may be incomplete. And, because I am adding 
new features to the chm parser for my own application,including parsing HHK and 
HHC files for more metadata; there are some distractions in my revisions which 
are not applicable to the original tika project. Sorry for the inconvenience.


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)