[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120648#comment-13120648 ] Mark Kerzner commented on TIKA-623: --- Hi, everybody, I have forked Richard Johnson's java-libpst project here on GitHub https://github.com/markkerzner/JavaLibpst. My reasons for doing this are as follows: 1. I need java-libpst parsing capabilities for my FreeEed project https://github.com/markkerzner/FreeEed 2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to see it included in Tika, which also needs it in Maven; 3. I want it in active development, and Richard told me that he has less time for it than before. 4. By no means do I want to take the glory or the project away from Richard, but it is one of the keys for FreeEed's adoption in Windows. I am in touch with Richard on all that, but I want the community feedback. Should I continue? Should I bring it into some Maven repository? I have been working with Carl Byington and know his libpst somewhat, so that additional qualification should help. Therefore, please, how am I to proceed? Thank you. > Add support for Outlook PST > --- > > Key: TIKA-623 > URL: https://issues.apache.org/jira/browse/TIKA-623 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Tran Nam Quang > Attachments: OutlookPSTParser.java > > > Hello everyone, > As you might know, Outlook stores its mails and other stuff in a single PST > file. There's a relatively new Java library called java-libpst for reading > Outlook PST files. It is licensed under the LGPL and available over here: > http://code.google.com/p/java-libpst/ > I have tested the library on Outlook 2000 and Outlook 2003, with good > results. It would be great if the library could be integrated into Tika. > Best regards > Tran Nam Quang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1311#comment-1311 ] Michael McCandless commented on TIKA-733: - Thank you Jeremy! Keep the patches coming ;) > [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException > -- > > Key: TIKA-733 > URL: https://issues.apache.org/jira/browse/TIKA-733 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.0 >Reporter: Jeremy Anderson >Assignee: Michael McCandless > Labels: patch > Fix For: 1.0 > > Attachments: > TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch > > > Parsing some RTF documents attempt to perform a removeLast() on the > groupStates() list when the list is empty. Added a check to not perform the > logic when the list is empty, thus causing the restore group state to not be > performed. Text extraction now completes without further down-stream errors. > Unable to include sample file due to sensitive nature of file contents. > StackTrace (TIKA-0.9) > Caused by: java.util.NoSuchElementException > at java.util.LinkedList.remove(LinkedList.java:788) > at java.util.LinkedList.removeLast(LinkedList.java:144) > at > org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010) > at > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352) > at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 45 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker
[ https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-742: Attachment: TIKA-742.patch Patch. > PDF2XHTML fails to insert nor space around page marker > -- > > Key: TIKA-742 > URL: https://issues.apache.org/jira/browse/TIKA-742 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: 86.pdf, TIKA-742.patch > > > I have a test document (unfortunately not committable) whose page > numbers are rendered with no separator ( nor space) before the next > word. So I have words like: > * 1Massachusetts > * 2Course > * 3also > * 4The > But then when I ran the ExtractText -html command-line from PDFBox, I > can see that is inserted after these page numbers (spookily, not > closing the previous ; I opened PDFBOX-1130 for that). > So I made a simple change to Tika's PDF2XHTML, to have it override the > writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to > preserve the paragraph structure that PDFBOX detects out to the > resulting XHTML handler, and this fixes the issue (I now see the page > number as a separate paragraph, rendered w/ newline in "text" mode > from TikaCLI). > Note that this test document is the same document from PDFBOX-1129 > (there are some quote characters that are not extracted correctly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker
[ https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-742: Attachment: 86.pdf PDF doc showing the issue (unfortunately not committable). > PDF2XHTML fails to insert nor space around page marker > -- > > Key: TIKA-742 > URL: https://issues.apache.org/jira/browse/TIKA-742 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: 86.pdf > > > I have a test document (unfortunately not committable) whose page > numbers are rendered with no separator ( nor space) before the next > word. So I have words like: > * 1Massachusetts > * 2Course > * 3also > * 4The > But then when I ran the ExtractText -html command-line from PDFBox, I > can see that is inserted after these page numbers (spookily, not > closing the previous ; I opened PDFBOX-1130 for that). > So I made a simple change to Tika's PDF2XHTML, to have it override the > writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to > preserve the paragraph structure that PDFBOX detects out to the > resulting XHTML handler, and this fixes the issue (I now see the page > number as a separate paragraph, rendered w/ newline in "text" mode > from TikaCLI). > Note that this test document is the same document from PDFBOX-1129 > (there are some quote characters that are not extracted correctly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-742) PDF2XHTML fails to insert nor space around page marker
PDF2XHTML fails to insert nor space around page marker -- Key: TIKA-742 URL: https://issues.apache.org/jira/browse/TIKA-742 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.0 Attachments: 86.pdf I have a test document (unfortunately not committable) whose page numbers are rendered with no separator ( nor space) before the next word. So I have words like: * 1Massachusetts * 2Course * 3also * 4The But then when I ran the ExtractText -html command-line from PDFBox, I can see that is inserted after these page numbers (spookily, not closing the previous ; I opened PDFBOX-1130 for that). So I made a simple change to Tika's PDF2XHTML, to have it override the writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to preserve the paragraph structure that PDFBOX detects out to the resulting XHTML handler, and this fixes the issue (I now see the page number as a separate paragraph, rendered w/ newline in "text" mode from TikaCLI). Note that this test document is the same document from PDFBOX-1129 (there are some quote characters that are not extracted correctly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira