[jira] Updated: (TIKA-423) Parse docx and output to text file missing words

2010-05-16 Thread David Tran (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Tran updated TIKA-423: Attachment: tika_test.docx output.txt Test docx file, and the output produced by TIka 0.7 >

[jira] Created: (TIKA-423) Parse docx and output to text file missing words

2010-05-16 Thread David Tran (JIRA)
Parse docx and output to text file missing words Key: TIKA-423 URL: https://issues.apache.org/jira/browse/TIKA-423 Project: Tika Issue Type: Bug Components: parser Affects Versio

[jira] Updated: (TIKA-423) Parse docx and output to text file missing words

2010-05-16 Thread David Tran (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Tran updated TIKA-423: Description: I created a word document using Word 2007 on a Windows Server 2003 machine (using Remote deskto

[jira] Commented: (TIKA-420) [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages

2010-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867989#action_12867989 ] Ken Krugler commented on TIKA-420: -- Also, do you have a small set of HMTL documents that cou

[jira] Commented: (TIKA-420) [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages

2010-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867988#action_12867988 ] Ken Krugler commented on TIKA-420: -- Hi Christian, I took a look at the patch just now. I'd