[jira] [Commented] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170399#comment-15170399 ] Hudson commented on TIKA-1860: -- UNSTABLE: Integrated in tika-2.x #34 (See [https://builds.apache.org/job/tika-2.x/34/]) TIKA-1860 - Added Bundle config to advanced, cad, code, crypto (bob: rev 8a5923dd6a42f4b4c09ec186ef357f446e9ae599) * tika-parser-modules/tika-parser-crypto-module/src/test/java/org/apache/tika/module/BundleIT.java * tika-parser-modules/tika-parser-advanced-module/src/test/java/org/apache/tika/module/BundleIT.java * tika-parser-modules/tika-parser-cad-module/src/test/java/org/apache/tika/module/BundleIT.java * tika-parser-modules/tika-parser-crypto-module/pom.xml * tika-parser-modules/tika-parser-multimedia-module/pom.xml * tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/module/advanced/internal/Activator.java * tika-parser-modules/tika-parser-code-module/pom.xml * tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/module/cad/internal/Activator.java * tika-parser-modules/tika-parser-advanced-module/pom.xml * tika-parser-modules/tika-parser-code-module/src/test/java/org/apache/tika/module/BundleIT.java * tika-parser-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/module/crypto/internal/Activator.java * tika-parser-modules/tika-parser-cad-module/pom.xml * tika-parser-modules/pom.xml * tika-parser-modules/tika-parser-code-module/src/main/java/org/apache/tika/module/code/internal/Activator.java > Tika 2.0 - Create Module OSGi implementations to replace tika-bundle > > > Key: TIKA-1860 > URL: https://issues.apache.org/jira/browse/TIKA-1860 > Project: Tika > Issue Type: Sub-task >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create a replacement for the OSGi tika-bundle project out of the new > tika-parser-* modules -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170086#comment-15170086 ] Luis Filipe Nassif edited comment on TIKA-1865 at 2/26/16 11:31 PM: I do not know if including the email into MESSAGE_TO will break backwards compatibility, because currently when there is no nickname, the email already goes there. The docs say nothing about the expected value and at least the RFC822Parser and MboxParser already put both name and email into that key. So, I think putting the email info into MESSAGE_(TO/CC/BCC) of MSG files will make things more consistent across parsers, that is why I suggested putting both values into those keys. was (Author: lfcnassif): I do not know if including the email into MESSAGE_TO will break backwards compatibility, because currently when there is no nickname, the email already goes there. The docs say nothing about the expected value and at least the RFC822Parser and MboxParser already put both name and email into that key. So, I think putting the email info into MESSAGE_(TO/CC/BCC) of MSG files will make things more consistent across parsers. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170086#comment-15170086 ] Luis Filipe Nassif commented on TIKA-1865: -- I do not know if including the email into MESSAGE_TO will break backwards compatibility, because currently when there is no nickname, the email already goes there. The docs say nothing about the expected value and at least the RFC822Parser and MboxParser already put both name and email into that key. So, I think putting the email info into MESSAGE_(TO/CC/BCC) of MSG files will make things more consistent across parsers. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170035#comment-15170035 ] Luis Filipe Nassif commented on TIKA-1865: -- Does Outlook display the sender's name or email for testMSG_chinese.msg? I think all msg files should keep the sender's email somewhere, not necessarily in header_from. It looks like POI must be patched for a complete solution, as Nick said. And I do not know anything about POI source code, unfortunately... > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169968#comment-15169968 ] Jeremy B. Merrill commented on TIKA-1865: - My heart wants to say yes, but my calendar says no. :) Or at least not with any time super soon. You're right that this is a ticket that's interesting to me, though. I did just get my own dump of real-life .msg files (not shareable, unfortunately) and I've noticed how senders' email addresses seem to get lost, which is a pain... Is this just a feature that is not yet implemented? Or is there an underlying reason why? (Funnily enough, it matches the behavior of Outlook printouts, which gives you only the sender's alias, not their address -- including, most annoyingly for me, in the dumps of Hillary Clinton's emails that the State Dept. has been releasing.) Do we know if all the various email formats include the sender's email address, so it'd be theoretically accessible to Tika somehow? What even are all the formats for emails that Tika handles? Outlook (PST/MSG), .eml/rfc822, mbox, anything else? > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Nagaraj Subramanya updated TIKA-1877: Attachment: 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984 tika-mimetypes.xml Attached the changed tika-mimetypes.xml and .fts file > On updating the tika-mimetypes.xml to detect .fts file format, tika detector > does not return anything > - > > Key: TIKA-1877 > URL: https://issues.apache.org/jira/browse/TIKA-1877 > Project: Tika > Issue Type: Bug > Components: mime >Reporter: Prasad Nagaraj Subramanya >Priority: Minor > Attachments: > 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, > tika-mimetypes.xml > > > The match value for .fts file format in tika-mimetypes.xml is "SIMPLE = > T". > Tika detected a .fts file as application/octet-stream. On verifying the > header I found the value to be "SIMPLE =T"(just 16 spaces > before = and T) > I tried the following changes- > Change 1) Updated the existing match value. But the build failed > Change 2) Added a new match value type="string" offset="0"/> after the existing one. > But now, tika returns empty value. It neither identifies the file as .fts nor > as application/octet-stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything
Prasad Nagaraj Subramanya created TIKA-1877: --- Summary: On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything Key: TIKA-1877 URL: https://issues.apache.org/jira/browse/TIKA-1877 Project: Tika Issue Type: Bug Components: mime Reporter: Prasad Nagaraj Subramanya Priority: Minor The match value for .fts file format in tika-mimetypes.xml is "SIMPLE = T". Tika detected a .fts file as application/octet-stream. On verifying the header I found the value to be "SIMPLE =T"(just 16 spaces before = and T) I tried the following changes- Change 1) Updated the existing match value. But the build failed Change 2) Added a new match value after the existing one. But now, tika returns empty value. It neither identifies the file as .fts nor as application/octet-stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169395#comment-15169395 ] Tim Allison commented on TIKA-1857: --- I implemented a first attempt XFA scraper with StAX; this pulls the content from the fields that Pascal identified into the ContentHhandler, and it merges the "values" from the data section with the fields section. Currently, if XFA exists, I process that and skip the AcroForm data. I'm not certain what the best path is for ignoring/processing content extracted from the "regular" PDF if there is XFA data. For now, I'm also processing the contents of the rest of the PDF. I'm more averse to losing data than to duplication because my main use case is search...but I realize this will be really frustrating to users who want "just one copy" of the content. In looking at the pdfs with xfa data in govdocs1, it looks like there would be lost content in _some_ files if we processed only the XFA and did not do the regular text extraction. On the other hand, for most of the files I examined, it looked like the content is entirely duplicative -- [~pascal.essiembre]'s point above. I propose adding a parameter to the PDFParserConfig along the lines of {{ifXFAExistsProcessItAlone}}...this would allow the behavior of Pascal's patch. I propose that the default be set to "false", erring on the side of extracting more content at the cost of duplication. Is this ok? Or, is there an easy way to determine if regular content is entirely duplicative of XFA content? > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files
Hi Nick, I have opened a pull request for the issue - https://github.com/apache/tika/pull/78 Thanks, Prasad On Fri, Feb 26, 2016 at 2:47 AM, Nick Burch (JIRA)wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168810#comment-15168810 > ] > > Nick Burch commented on TIKA-1875: > -- > > No patch of pull request was attached. You can attach a patch using "More" > then "Attach Files", otherwise if you use github you can share the patch > through opening a pull request > > > Updating tika-mimetypes.xml to detect .NC files > > > > > > Key: TIKA-1875 > > URL: https://issues.apache.org/jira/browse/TIKA-1875 > > Project: Tika > > Issue Type: Improvement > > Components: mime > >Affects Versions: 1.12 > >Reporter: Prasad Nagaraj Subramanya > >Priority: Minor > > Labels: patch > > Fix For: 1.11 > > > > > > Adding magic number to detect .NC files > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169009#comment-15169009 ] Tim Allison commented on TIKA-1865: --- Completely agree on all counts. Did not mean to suggest breaking backwards compat! And, y, this will require mods to mbox, etc. Thank you! bq. find a suitable metadata scheme Any recommendations? bq. add additional keys that hold the email addresses and the names in a way that they can be helpfully associated together? Until TIKA-1607 is solved, perhaps parallel arrays for something like these metadata keys: "MESSAGE_TO_EMAIL", "MESSAGE_TO_NAME"? > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169005#comment-15169005 ] Nick Burch commented on TIKA-1865: -- Whatever we do, matching changes should be made to the other Email file format parsers to keep things consistent I'm not sure we should be changing the existing keys to suddenly hold different values, that'll break backwards compatibility and likely confuse existing users Maybe we should find a suitable metadata scheme for this, and add additional keys that hold the email addresses and the names in a way that they can be helpfully associated together? > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168996#comment-15168996 ] Tim Allison commented on TIKA-1865: --- [~jeremybmerrill], any interest in this? Want to contribute? > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168945#comment-15168945 ] Tim Allison edited comment on TIKA-1865 at 2/26/16 1:17 PM: With the handful of MSG files in our "test-documents", I get this: {noformat} test-outlook2003.msg emailFromChunk:olt...@microsoft.com header_from:null testMSG.msg emailFromChunk:jukka.zitt...@gmail.com header_from:From: Jukka ZittingtestMSG_att_doc.msg emailFromChunk:nicolas1.23...@free.fr header_from:null testMSG_att_msg.msg emailFromChunk:/O=PHILLIPS ORMONDE AND FITZPATRICK/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=NICK.BOOTH header_from:From: Nick Booth testMSG_chinese.msg emailFromChunk:/O=FT GROUP/OU=FT/CN=RECIPIENTS/CN=LYDIACHANG header_from:null testMSG_forwarded.msg emailFromChunk:/O=OEXCH018/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=PAUL_METAJURE header_from:From: Paul Allan Hill {noformat} Perhaps a strategy of try emailFromChunk and then back off to a regex on the header {{From}} if that's there? That would get a "regular" email address from the above except for {{testMSG_chinese.msg}}. Or, is the exchange info useful to you if that's all we can get, as well? was (Author: talli...@mitre.org): With the handful of MSG files in our "test-documents", I get this: {noformat} test-outlook2003.msg : olt...@microsoft.com testMSG.msg : jukka.zitt...@gmail.com testMSG_att_doc.msg : nicolas1.23...@free.fr testMSG_att_msg.msg : /O=PHILLIPS ORMONDE AND FITZPATRICK/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=NICK.BOOTH testMSG_chinese.msg : /O=FT GROUP/OU=FT/CN=RECIPIENTS/CN=LYDIACHANG testMSG_forwarded.msg : /O=OEXCH018/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=PAUL_METAJURE {noformat} > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168951#comment-15168951 ] Tim Allison commented on TIKA-1865: --- And if you are interested in working on a patch for this, we now have ~3800 msg files that I pulled with [~centic]'s CommonCrawlDocumentDownload tool...in addition to what we had in our slice of CommonCrawl and govdocs1. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168946#comment-15168946 ] Tim Allison commented on TIKA-1865: --- Yes and yes...any interest in submitting a patch? If you're interested in this info, you might also be interested TIKA-1759, a low priority for me at the time, but that could change if there was interest from the community. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168949#comment-15168949 ] Tim Allison commented on TIKA-1865: --- Thank you, Nick. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)