[
https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-197.
--------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Thanks for reporting this!
This issue was caused by the OfficeParser class using a special pattern for
detecting Outlook-specific entries inside Microsoft's OLE2 container format.
Outlook-specific parsing was triggered whenever an internal entry matching the
pattern was detected. Our previous test .msg file only contained one such entry
so we never saw this issue, but apparently it's possible and even likely for
Outlook files to contain multiple such entries.
I fixed the issue in revision 742187 simply by introducing a special marker
flag that prevents the Outlook extractor from being fired more than once per
document being parsed. It's a bit ugly, but it works. :-)
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
> Key: TIKA-197
> URL: https://issues.apache.org/jira/browse/TIKA-197
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: kumar raja jana
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.