[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-10-04 Thread Mark Kerzner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120648#comment-13120648
 ] 

Mark Kerzner commented on TIKA-623:
---

Hi, everybody,

I have forked Richard Johnson's java-libpst project here on GitHub 
https://github.com/markkerzner/JavaLibpst. My reasons for doing this are as 
follows:

1. I need java-libpst parsing capabilities for my FreeEed project 
https://github.com/markkerzner/FreeEed
2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to 
see it included in Tika, which also needs it in Maven;
3. I want it in active development, and Richard told me that he has less time 
for it than before.
4. By no means do I want to take the glory or the project away from Richard, 
but it is one of the keys for FreeEed's adoption in Windows.

I am in touch with Richard on all that, but I want the community feedback. 
Should I continue? Should I bring it into some Maven repository? I have been 
working with Carl Byington and know his libpst somewhat, so that additional 
qualification should help. Therefore, please, how am I to proceed?

Thank you.



> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-04 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1311#comment-1311
 ] 

Michael McCandless commented on TIKA-733:
-

Thank you Jeremy!  Keep the patches coming ;)

> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> --
>
> Key: TIKA-733
> URL: https://issues.apache.org/jira/browse/TIKA-733
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.0
>Reporter: Jeremy Anderson
>Assignee: Michael McCandless
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 
> TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the 
> groupStates() list when the list is empty.  Added a check to not perform the 
> logic when the list is empty, thus causing the restore group state to not be 
> performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
>   at java.util.LinkedList.remove(LinkedList.java:788)
>   at java.util.LinkedList.removeLast(LinkedList.java:144)
>   at 
> org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
>   at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
>   at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

2011-10-04 Thread Michael McCandless (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-742:


Attachment: TIKA-742.patch

Patch.

> PDF2XHTML fails to insert  nor space around page marker
> --
>
> Key: TIKA-742
> URL: https://issues.apache.org/jira/browse/TIKA-742
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.0
>
> Attachments: 86.pdf, TIKA-742.patch
>
>
> I have a test document (unfortunately not committable) whose page
> numbers are rendered with no separator ( nor space) before the next
> word.  So I have words like:
>   * 1Massachusetts
>   * 2Course
>   * 3also
>   * 4The
> But then when I ran the ExtractText -html command-line from PDFBox, I
> can see that  is inserted after these page numbers (spookily, not
> closing the previous ; I opened PDFBOX-1130 for that).
> So I made a simple change to Tika's PDF2XHTML, to have it override the
> writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to
> preserve the paragraph structure that PDFBOX detects out to the
> resulting XHTML handler, and this fixes the issue (I now see the page
> number as a separate paragraph, rendered w/ newline in "text" mode
> from TikaCLI).
> Note that this test document is the same document from PDFBOX-1129
> (there are some quote characters that are not extracted correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

2011-10-04 Thread Michael McCandless (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-742:


Attachment: 86.pdf

PDF doc showing the issue (unfortunately not committable).

> PDF2XHTML fails to insert  nor space around page marker
> --
>
> Key: TIKA-742
> URL: https://issues.apache.org/jira/browse/TIKA-742
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.0
>
> Attachments: 86.pdf
>
>
> I have a test document (unfortunately not committable) whose page
> numbers are rendered with no separator ( nor space) before the next
> word.  So I have words like:
>   * 1Massachusetts
>   * 2Course
>   * 3also
>   * 4The
> But then when I ran the ExtractText -html command-line from PDFBox, I
> can see that  is inserted after these page numbers (spookily, not
> closing the previous ; I opened PDFBOX-1130 for that).
> So I made a simple change to Tika's PDF2XHTML, to have it override the
> writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to
> preserve the paragraph structure that PDFBOX detects out to the
> resulting XHTML handler, and this fixes the issue (I now see the page
> number as a separate paragraph, rendered w/ newline in "text" mode
> from TikaCLI).
> Note that this test document is the same document from PDFBOX-1129
> (there are some quote characters that are not extracted correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

2011-10-04 Thread Michael McCandless (Created) (JIRA)
PDF2XHTML fails to insert  nor space around page marker
--

 Key: TIKA-742
 URL: https://issues.apache.org/jira/browse/TIKA-742
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.0
 Attachments: 86.pdf

I have a test document (unfortunately not committable) whose page
numbers are rendered with no separator ( nor space) before the next
word.  So I have words like:
  * 1Massachusetts
  * 2Course
  * 3also
  * 4The

But then when I ran the ExtractText -html command-line from PDFBox, I
can see that  is inserted after these page numbers (spookily, not
closing the previous ; I opened PDFBOX-1130 for that).

So I made a simple change to Tika's PDF2XHTML, to have it override the
writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to
preserve the paragraph structure that PDFBOX detects out to the
resulting XHTML handler, and this fixes the issue (I now see the page
number as a separate paragraph, rendered w/ newline in "text" mode
from TikaCLI).

Note that this test document is the same document from PDFBOX-1129
(there are some quote characters that are not extracted correctly).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira