[jira] [Created] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition
Manali Shah created TIKA-1876: - Summary: Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition Key: TIKA-1876 URL: https://issues.apache.org/jira/browse/TIKA-1876 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.13 Reporter: Manali Shah Fix For: 1.13 Hi all, Apache Tika already performs Named Entity Recognition using Open NLP and Stanford Core NLP. Natural Language Toolkit is another open source python library and I believe it will be a great idea to have NLTK integrated along with Tika. NLTK can extract NER as well as classify them. For this purpose I, along with Prof Chris Mattmann have published NLTKRest, a python pip/setuptools installable module that exposes NLTK as a REST service. I have tested the working of Tika along with NLTKRest on my local repository and will soon submit a pull request. Link to rest server: https://github.com/manalishah/NLTKRest -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Tika 1875
GitHub user prasadns14 opened a pull request: https://github.com/apache/tika/pull/78 Tika 1875 Updated netcdf mime type magic number File - tika-mimetypes.xml You can merge this pull request into a Git repository by running: $ git pull https://github.com/prasadns14/tika TIKA-1875 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/78.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #78 commit 3610d7db1ec6b75ca1927c49b97452333a236a0f Author: Chris Mattmann Date: 2015-10-19T06:21:36Z [maven-release-plugin] copy for tag 1.11-rc1 git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.11-rc1@1709359 13f79535-47bb-0310-9956-ffa450edef68 commit 07413364e52299a9d8ac9585e7e3893ca92dea2f Author: prasadns14 Date: 2016-02-26T02:37:50Z fix for TIKA-1875 contributed by prasadns14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files
Prasad Nagaraj Subramanya created TIKA-1875: --- Summary: Updating tika-mimetypes.xml to detect .NC files Key: TIKA-1875 URL: https://issues.apache.org/jira/browse/TIKA-1875 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.12 Reporter: Prasad Nagaraj Subramanya Priority: Minor Fix For: 1.11 Adding magic number to detect .NC files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168291#comment-15168291 ] Luis Filipe Nassif commented on TIKA-1865: -- Also, what do you think about including in MESSAGE_TO, MESSAGE_CC and MESSAGE_BCC metadata the recipient names AND their email addresses, so users could know the recipient type (to, cc, bcc) of each email? It is not possible with current approach, including all recipient adresses together in MESSAGE_RECIPIENT_ADDRESS. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1857: -- Attachment: govdocs1_xfas.zip 194 xfas from govdocs1 as exported with PDFBox 2.0 (trunk built from within the last few weeks). > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168075#comment-15168075 ] Nick Burch commented on TIKA-1855: -- I'm not actually sure we need to do the unzipping thing. I think that most of the unit tests that need to check for {{File}} vs {{InputStream}} differences can be / should be / are in Tika Core. If we put the test documents under the Tika Core resources folder (and hence in the tika-core-tests jar), those can access as either. The handful of other tests elsewhere that need files can use the existing helpers (maybe nicer wrapped) to get there 1 or 2 files spooled out to a temp File for File checking We generally tell off anyone adding very large test files, and that's worked fairly well so far, even with tika-parsers/src/test/resources/test-documents working as our defacto "dumping ground" :) > TIka 2.0 - Move shared test-code back to tika-core and distribute test files > to parser modules > -- > > Key: TIKA-1855 > URL: https://issues.apache.org/jira/browse/TIKA-1855 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Tim Allison > > Undo TIKA-1851, and divide test docs to appropriate parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167988#comment-15167988 ] Luis Filipe Nassif commented on TIKA-1865: -- Hi [~talli...@apache.org]! I think MAPIMessage.getMainChunks().emailFromChunk already have that info, or not for all cases? It worked with my small corpus. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167891#comment-15167891 ] Ken Krugler commented on TIKA-1855: --- The things I don't like about this approach are that (a) core becomes a dumping ground for everyone's test data, and (b) it couples module development with the core. Plus I'm waiting for the next crazy parser to be added that has 100MB of binary test data, which will create an el grande jar that everybody is going to be unzipping. So I guess I'd add scalability as another concern. I haven't looked into where test files wind up, but I'd suspect that many of the core tests that wind up needing to be in parsers due to data dependencies aren't really the tests that should be run in core. I can see mime-type detection being an example of wanting to have one of each, and (maybe) some of the app/server tests, so I'd be fine with having a tika-test-corpus (or whatever you want to call it) that has a good sampling of docs which are used in these situations. Finally, to make myself really popular, I'd prefer that we use the jar as a test dependency (vs. zip/unzip), and for cases where we need to have an actual file then use some utility code to extract/create the file. Maybe we should have a Skype chat to discuss VF2F :) > TIka 2.0 - Move shared test-code back to tika-core and distribute test files > to parser modules > -- > > Key: TIKA-1855 > URL: https://issues.apache.org/jira/browse/TIKA-1855 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Tim Allison > > Undo TIKA-1851, and divide test docs to appropriate parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: parallel dev on trunk and 2.x?
+1 I haven’t fully moved over to 2.x yet b/c I haven’t honestly had time to catch up. I suppose after my class in May I will have time to catch up then and I can focus more on 2.x then. But for me I am doing all my work in 1.x now so keeping up to date would be great. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: "Allison, Timothy B." Reply-To: "dev@tika.apache.org" Date: Thursday, February 25, 2016 at 12:50 PM To: "dev@tika.apache.org" Subject: parallel dev on trunk and 2.x? >All, > Do I understand correctly that we should be committing most changes to >both trunk and 2.x? Obviously, the 2.x commits are for 2.x. :) > Or will merge really, actually, truly work at some point in the future >to merge changes in trunk to 2.x? > >Best, > > Tim > >-Original Message- >From: Hudson (JIRA) [mailto:j...@apache.org] >Sent: Thursday, February 25, 2016 1:41 PM >To: dev@tika.apache.org >Subject: [jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager > > >[ >https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.pl >ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comm >ent-15167620 ] > >Hudson commented on TIKA-1874: >-- > >SUCCESS: Integrated in tika-2.x #31 (See >[https://builds.apache.org/job/tika-2.x/31/]) >TIKA-1874 fix small npe (tallison: rev >5083cc11c6230218ecef7d0161fa92bbf8d317e6) >* >tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tik >a/parser/microsoft/ooxml/XWPFListManager.java > > >> Fix rare npe in XWPFListManager >> --- >> >> Key: TIKA-1874 >> URL: https://issues.apache.org/jira/browse/TIKA-1874 >> Project: Tika >> Issue Type: Bug >>Reporter: Tim Allison >>Priority: Trivial >> >> Many thanks to [~centic]'s >>[CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocume >>ntDownload], I recently grabbed .docx files from the initial index that >>comes with that code. I'll be adding these docs to our regular >>regression testing for TIKA-1302. >> While running Tika on these ~166k docs, ~30 of those files had an NPE >>in XWPFListManager. We need to add a null check. > > > >-- >This message was sent by Atlassian JIRA >(v6.3.4#6332)
parallel dev on trunk and 2.x?
All, Do I understand correctly that we should be committing most changes to both trunk and 2.x? Obviously, the 2.x commits are for 2.x. :) Or will merge really, actually, truly work at some point in the future to merge changes in trunk to 2.x? Best, Tim -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Thursday, February 25, 2016 1:41 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager [ https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comment-15167620 ] Hudson commented on TIKA-1874: -- SUCCESS: Integrated in tika-2.x #31 (See [https://builds.apache.org/job/tika-2.x/31/]) TIKA-1874 fix small npe (tallison: rev 5083cc11c6230218ecef7d0161fa92bbf8d317e6) * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java > Fix rare npe in XWPFListManager > --- > > Key: TIKA-1874 > URL: https://issues.apache.org/jira/browse/TIKA-1874 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > Many thanks to [~centic]'s > [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload], > I recently grabbed .docx files from the initial index that comes with that > code. I'll be adding these docs to our regular regression testing for > TIKA-1302. > While running Tika on these ~166k docs, ~30 of those files had an NPE in > XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167640#comment-15167640 ] Hudson commented on TIKA-1870: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #915 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/915/]) TIKA-1870 refactor RichTextContentHandler into tika-core from (nhoj.patrick: rev 0bd05cec54c581c971d90380304aaa23c9543296) * tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java * tika-core/src/main/java/org/apache/tika/sax/RichTextContentHandler.java * tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java * tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java TIKA-1870 JavaDoc and Test coverage for RichTextContentHandler that (nhoj.patrick: rev 3b7922db1a2e72181e1a00168d2aee33bfe1d4a3) * tika-core/src/test/java/org/apache/tika/sax/RichTextContentHandlerTest.java * tika-core/src/main/java/org/apache/tika/sax/RichTextContentHandler.java TIKA-1870 Move RichTextContentHandler from Server to Core, contributed (nick: rev ed762b702875c843d0322b8ba6d05385ca91875d) * CHANGES.txt > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager
[ https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comment-15167620 ] Hudson commented on TIKA-1874: -- SUCCESS: Integrated in tika-2.x #31 (See [https://builds.apache.org/job/tika-2.x/31/]) TIKA-1874 fix small npe (tallison: rev 5083cc11c6230218ecef7d0161fa92bbf8d317e6) * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java > Fix rare npe in XWPFListManager > --- > > Key: TIKA-1874 > URL: https://issues.apache.org/jira/browse/TIKA-1874 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > Many thanks to [~centic]'s > [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload], > I recently grabbed .docx files from the initial index that comes with that > code. I'll be adding these docs to our regular regression testing for > TIKA-1302. > While running Tika on these ~166k docs, ~30 of those files had an NPE in > XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167570#comment-15167570 ] ASF GitHub Bot commented on TIKA-1870: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/77 > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1870. -- Resolution: Fixed Thanks for preparing patches for all this work. Merged and pushed! > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Refector RichTextContentHandler for TIKA-1870 c...
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/77 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager
[ https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167550#comment-15167550 ] Hudson commented on TIKA-1874: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #914 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/914/]) TIKA-1874 fix potential npe (tallison: rev 0c030081bba17e607f8c79a0b95f72935be93efd) * tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java > Fix rare npe in XWPFListManager > --- > > Key: TIKA-1874 > URL: https://issues.apache.org/jira/browse/TIKA-1874 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > Many thanks to [~centic]'s > [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload], > I recently grabbed .docx files from the initial index that comes with that > code. I'll be adding these docs to our regular regression testing for > TIKA-1302. > While running Tika on these ~166k docs, ~30 of those files had an NPE in > XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1874) Fix rare npe in XWPFListManager
[ https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1874: -- Description: Many thanks to [~centic]'s [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload], I recently grabbed .docx files from the initial index that comes with that code. I'll be adding these docs to our regular regression testing for TIKA-1302. While running Tika on these ~166k docs, ~30 of those files had an NPE in XWPFListManager. We need to add a null check. was: Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently grabbed .docx files from the initial index that comes with that code. While running Tika on these ~166k docs, ~30 of those files had an NPE in XWPFListManager. We need to add a null check. > Fix rare npe in XWPFListManager > --- > > Key: TIKA-1874 > URL: https://issues.apache.org/jira/browse/TIKA-1874 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > Many thanks to [~centic]'s > [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload], > I recently grabbed .docx files from the initial index that comes with that > code. I'll be adding these docs to our regular regression testing for > TIKA-1302. > While running Tika on these ~166k docs, ~30 of those files had an NPE in > XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1874) Fix rare npe in XWPFListManager
[ https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1874. --- Resolution: Fixed > Fix rare npe in XWPFListManager > --- > > Key: TIKA-1874 > URL: https://issues.apache.org/jira/browse/TIKA-1874 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently > grabbed .docx files from the initial index that comes with that code. > While running Tika on these ~166k docs, ~30 of those files had an NPE in > XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1874) Fix rare npe in XWPFListManager
Tim Allison created TIKA-1874: - Summary: Fix rare npe in XWPFListManager Key: TIKA-1874 URL: https://issues.apache.org/jira/browse/TIKA-1874 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Trivial Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently grabbed .docx files from the initial index that comes with that code. While running Tika on these ~166k docs, ~30 of those files had an NPE in XWPFListManager. We need to add a null check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167296#comment-15167296 ] John Patrick commented on TIKA-1870: Added JavaDoc and Unit Test, although I'm assuming I've documented and tested it correctly. > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167231#comment-15167231 ] Nick Burch commented on TIKA-1865: -- IIRC it needs the "fixed length properties" support to be completed to be able to get out > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167211#comment-15167211 ] Tim Allison commented on TIKA-1865: --- Good to hear from you, [~lfcnassif]! I've only looked at this very briefly, but it looks like POI does not currently make the sender email address available. I think the best next step would be to figure out how to modify POI to make this info available. Any interest in looking into this? I did see that the email address exists _sometimes_ in the header {{From:}}, and we could pull it out via regex, but several of our test MSG files clearly have the sender email in the bytes but have no headers. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167208#comment-15167208 ] Tim Allison commented on TIKA-1607: --- Aside from XMP, I can't think of an example where we'd have multiple DOMs of the same type (property name). For some (rare) PDF files, I could see having a DOM for XFA and one or more DOMs for XMP, but they'd be under different keys...in my current plan. I could also see someone modifying an existing parser to generate a DOM to this type of field, say, by translating what we're pulling out of the metadata for a multimedia file into pbcore. On the one hand, this is a hack on the way to your unified DOM proposal...basic users can get what they want from key/value, and advanced users who actually know a given standard can find what they need. On the other, this would allow advanced users to extract potentially conflicting metadata (one XMP packet has dc:creator X, but the update XMP packet has dc:creator Y...and we even have this in one of our test files :)). By following the XMP standard (iirc), the more recent packet information would overwrite the earlier. Some users will want the "standard" (dc:creator=Y); some advanced users might want "all" (dc:creator=X;Y). The initial motivation for giving access to the raw bytes...if we allow access to the raw bytes for a DOM, this could also allow super advanced users to run their own content stripping that might not care about slightly dodgy/invalid xml, and we already have an example of invalid XMP in one of our multimedia files. However, I'm persuaded that making "bytes" available could lead to disaster. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167135#comment-15167135 ] Ray Gauss II commented on TIKA-1607: I know there can be multiple XMP packets in a single file, but do we have many other examples where we'd need multiple DOMs associated with a single file? I'm trying to understand if the metadata is really the right place for this. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167123#comment-15167123 ] Nick Burch commented on TIKA-1855: -- Currently, we have most test documents in Tika Parsers, and a handful in Tika Core, which is sometimes confusing. We also end up with quite a lot of the unit tests for Tika Core actually being in the Tika Parsers test area, so that they can use the test documents in parsers which aren't in core. Based on my experiences with this (eg where I start putting things in the wrong module, initially can't find the right unit test etc), I find it non-ideal, and I suspect it's not intuitive at all for new contributors. For the Ogg Vorbis stuff I maintain, I've opted to put all of the test files needed in {{core/src/test/resources}} then have the other maven modules (eg the Tika one and the Tools one) depend on the core-test artifact as a test-scope dependency in order for their unit tests to access the common set of test files. I find this actually works quite well, now I have it set up, and it seems ok for both InputStream and File based tests So, given the above two, I would suggest that we put all of our test documents from core, parsers, server and bundle (all of which seem to have their own ones at the moment!) into a single artifact. We then depend on that artifact for all of our tests, with a test scope > TIka 2.0 - Move shared test-code back to tika-core and distribute test files > to parser modules > -- > > Key: TIKA-1855 > URL: https://issues.apache.org/jira/browse/TIKA-1855 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Tim Allison > > Undo TIKA-1851, and divide test docs to appropriate parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed
[ https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167012#comment-15167012 ] Nick Burch commented on TIKA-1873: -- Interesting stuff! I'd skip most container-based formats, and especially OLE2 formats though. With OLE2 the only bit you can be sure of is the 512/4096 (1 block) header at the start, which basically says "I'm OLE2". After that, you can put the blocks in any order, so one file could have the first bit of word data starting at 513 bytes, another could have that as the last 512 bytes of the file, and both are valid! > Test Cases failed when tika-mimetypes.xml is changed > > > Key: TIKA-1873 > URL: https://issues.apache.org/jira/browse/TIKA-1873 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Antriksh Saxena > Labels: test > > The test cases were failing when tika was built after updating the > tika-mimetypes.xml. The failure logs are as follows. > {code} > TestContainerAwareDetector.testTruncatedFiles:395 > expected: but was: > TestMimeTypes.testOLE2Detection:138->assertTypeByData:1045 > expected: but was: > TestMimeTypes.testOldExcel:251->assertTypeByData:1045 > expected: but was: > TestMimeTypes.testVisioDetection:305->assertTypeByNameAndData:1071 > expected: but was: > ExcelParserTest.testExcel95:320 expected: but > was: > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)