[jira] [Created] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition

2016-02-25 Thread Manali Shah (JIRA)
Manali Shah created TIKA-1876:
-

 Summary: Integrate Natural Language Toolkit (NLTK) into Tika to 
perform Named Entity Recognition
 Key: TIKA-1876
 URL: https://issues.apache.org/jira/browse/TIKA-1876
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.13
Reporter: Manali Shah
 Fix For: 1.13


Hi all, 

Apache Tika already performs Named Entity Recognition using Open NLP and 
Stanford Core NLP. Natural Language Toolkit is another open source python 
library and I believe it will be a great idea to have NLTK integrated along 
with Tika. 
NLTK can extract NER as well as classify them. For this purpose I, along with 
Prof Chris Mattmann have published NLTKRest, a python pip/setuptools 
installable module that exposes NLTK as a REST service. 

I have tested the working of Tika along with NLTKRest on my local repository 
and will soon submit a pull request. 
Link to rest server: https://github.com/manalishah/NLTKRest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Tika 1875

2016-02-25 Thread prasadns14
GitHub user prasadns14 opened a pull request:

https://github.com/apache/tika/pull/78

Tika 1875

Updated netcdf mime type magic number
File - tika-mimetypes.xml

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/prasadns14/tika TIKA-1875

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/78.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #78


commit 3610d7db1ec6b75ca1927c49b97452333a236a0f
Author: Chris Mattmann 
Date:   2015-10-19T06:21:36Z

[maven-release-plugin]  copy for tag 1.11-rc1

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.11-rc1@1709359 
13f79535-47bb-0310-9956-ffa450edef68

commit 07413364e52299a9d8ac9585e7e3893ca92dea2f
Author: prasadns14 
Date:   2016-02-26T02:37:50Z

fix for TIKA-1875 contributed by prasadns14




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files

2016-02-25 Thread Prasad Nagaraj Subramanya (JIRA)
Prasad Nagaraj Subramanya created TIKA-1875:
---

 Summary: Updating tika-mimetypes.xml to detect .NC files 
 Key: TIKA-1875
 URL: https://issues.apache.org/jira/browse/TIKA-1875
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.12
Reporter: Prasad Nagaraj Subramanya
Priority: Minor
 Fix For: 1.11


Adding magic number to detect .NC files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168291#comment-15168291
 ] 

Luis Filipe Nassif commented on TIKA-1865:
--

Also, what do you think about including in MESSAGE_TO, MESSAGE_CC and 
MESSAGE_BCC metadata the recipient names AND their email addresses, so users 
could know the recipient type (to, cc, bcc) of each email? It is not possible 
with current approach, including all recipient adresses together in 
MESSAGE_RECIPIENT_ADDRESS.

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1857:
--
Attachment: govdocs1_xfas.zip

194 xfas from govdocs1 as exported with PDFBox 2.0 (trunk built from within the 
last few weeks).

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168075#comment-15168075
 ] 

Nick Burch commented on TIKA-1855:
--

I'm not actually sure we need to do the unzipping thing. I think that most of 
the unit tests that need to check for {{File}} vs {{InputStream}} differences 
can be / should be / are in Tika Core. If we put the test documents under the 
Tika Core resources folder (and hence in the tika-core-tests jar), those can 
access as either. The handful of other tests elsewhere that need files can use 
the existing helpers (maybe nicer wrapped) to get there 1 or 2 files spooled 
out to a temp File for File checking

We generally tell off anyone adding very large test files, and that's worked 
fairly well so far, even with tika-parsers/src/test/resources/test-documents 
working as our defacto "dumping ground" :)

> TIka 2.0 - Move shared test-code back to tika-core and distribute test files 
> to parser modules
> --
>
> Key: TIKA-1855
> URL: https://issues.apache.org/jira/browse/TIKA-1855
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Undo TIKA-1851, and divide test docs to appropriate parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167988#comment-15167988
 ] 

Luis Filipe Nassif commented on TIKA-1865:
--

Hi [~talli...@apache.org]!

I think MAPIMessage.getMainChunks().emailFromChunk already have that info, or 
not for all cases? It worked with my small corpus.



> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167891#comment-15167891
 ] 

Ken Krugler commented on TIKA-1855:
---

The things I don't like about this approach are that (a) core becomes a dumping 
ground for everyone's test data, and (b) it couples module development with the 
core. Plus I'm waiting for the next crazy parser to be added that has 100MB of 
binary test data, which will create an el grande jar that everybody is going to 
be unzipping. So I guess I'd add scalability as another concern.

I haven't looked into where test files wind up, but I'd suspect that many of 
the core tests that wind up needing to be in parsers due to data dependencies 
aren't really the tests that should be run in core. I can see mime-type 
detection being an example of wanting to have one of each, and (maybe) some of 
the app/server tests, so I'd be fine with having a tika-test-corpus (or 
whatever you want to call it) that has a good sampling of docs which are used 
in these situations.

Finally, to make myself really popular, I'd prefer that we use the jar as a 
test dependency (vs. zip/unzip), and for cases where we need to have an actual 
file then use some utility code to extract/create the file.

Maybe we should have a Skype chat to discuss VF2F :)

> TIka 2.0 - Move shared test-code back to tika-core and distribute test files 
> to parser modules
> --
>
> Key: TIKA-1855
> URL: https://issues.apache.org/jira/browse/TIKA-1855
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Undo TIKA-1851, and divide test docs to appropriate parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: parallel dev on trunk and 2.x?

2016-02-25 Thread Mattmann, Chris A (3980)
+1 I haven’t fully moved over to 2.x yet b/c I haven’t honestly
had time to catch up. I suppose after my class in May I will have
time to catch up then and I can focus more on 2.x then. But for me
I am doing all my work in 1.x now so keeping up to date would be
great.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, February 25, 2016 at 12:50 PM
To: "dev@tika.apache.org" 
Subject: parallel dev on trunk and 2.x?

>All,
>  Do I understand correctly that we should be committing most changes to
>both trunk and 2.x?  Obviously, the 2.x commits are for 2.x. :)
>  Or will merge really, actually, truly work at some point in the future
>to merge changes in trunk to 2.x?
>
>Best,
>
>   Tim
>
>-Original Message-
>From: Hudson (JIRA) [mailto:j...@apache.org]
>Sent: Thursday, February 25, 2016 1:41 PM
>To: dev@tika.apache.org
>Subject: [jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager
>
>
>[ 
>https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comm
>ent-15167620 ] 
>
>Hudson commented on TIKA-1874:
>--
>
>SUCCESS: Integrated in tika-2.x #31 (See
>[https://builds.apache.org/job/tika-2.x/31/])
>TIKA-1874 fix small npe (tallison: rev
>5083cc11c6230218ecef7d0161fa92bbf8d317e6)
>* 
>tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tik
>a/parser/microsoft/ooxml/XWPFListManager.java
>
>
>> Fix rare npe in XWPFListManager
>> ---
>>
>> Key: TIKA-1874
>> URL: https://issues.apache.org/jira/browse/TIKA-1874
>> Project: Tika
>>  Issue Type: Bug
>>Reporter: Tim Allison
>>Priority: Trivial
>>
>> Many thanks to [~centic]'s
>>[CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocume
>>ntDownload], I recently grabbed .docx files from the initial index that
>>comes with that code.  I'll be adding these docs to our regular
>>regression testing for TIKA-1302.
>> While running Tika on these ~166k docs, ~30 of those files had an NPE
>>in XWPFListManager.  We need to add a null check.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



parallel dev on trunk and 2.x?

2016-02-25 Thread Allison, Timothy B.
All,
  Do I understand correctly that we should be committing most changes to both 
trunk and 2.x?  Obviously, the 2.x commits are for 2.x. :)  
  Or will merge really, actually, truly work at some point in the future to 
merge changes in trunk to 2.x?

Best,

   Tim

-Original Message-
From: Hudson (JIRA) [mailto:j...@apache.org] 
Sent: Thursday, February 25, 2016 1:41 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager


[ 
https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comment-15167620
 ] 

Hudson commented on TIKA-1874:
--

SUCCESS: Integrated in tika-2.x #31 (See 
[https://builds.apache.org/job/tika-2.x/31/])
TIKA-1874 fix small npe (tallison: rev 5083cc11c6230218ecef7d0161fa92bbf8d317e6)
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java


> Fix rare npe in XWPFListManager
> ---
>
> Key: TIKA-1874
> URL: https://issues.apache.org/jira/browse/TIKA-1874
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> Many thanks to [~centic]'s 
> [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload],
>  I recently grabbed .docx files from the initial index that comes with that 
> code.  I'll be adding these docs to our regular regression testing for 
> TIKA-1302.
> While running Tika on these ~166k docs, ~30 of those files had an NPE in 
> XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167640#comment-15167640
 ] 

Hudson commented on TIKA-1870:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #915 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/915/])
TIKA-1870 refactor RichTextContentHandler into tika-core from (nhoj.patrick: 
rev 0bd05cec54c581c971d90380304aaa23c9543296)
* 
tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
* tika-core/src/main/java/org/apache/tika/sax/RichTextContentHandler.java
* tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java
TIKA-1870 JavaDoc and Test coverage for RichTextContentHandler that 
(nhoj.patrick: rev 3b7922db1a2e72181e1a00168d2aee33bfe1d4a3)
* tika-core/src/test/java/org/apache/tika/sax/RichTextContentHandlerTest.java
* tika-core/src/main/java/org/apache/tika/sax/RichTextContentHandler.java
TIKA-1870 Move RichTextContentHandler from Server to Core, contributed (nick: 
rev ed762b702875c843d0322b8ba6d05385ca91875d)
* CHANGES.txt


> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager

2016-02-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comment-15167620
 ] 

Hudson commented on TIKA-1874:
--

SUCCESS: Integrated in tika-2.x #31 (See 
[https://builds.apache.org/job/tika-2.x/31/])
TIKA-1874 fix small npe (tallison: rev 5083cc11c6230218ecef7d0161fa92bbf8d317e6)
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java


> Fix rare npe in XWPFListManager
> ---
>
> Key: TIKA-1874
> URL: https://issues.apache.org/jira/browse/TIKA-1874
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> Many thanks to [~centic]'s 
> [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload],
>  I recently grabbed .docx files from the initial index that comes with that 
> code.  I'll be adding these docs to our regular regression testing for 
> TIKA-1302.
> While running Tika on these ~166k docs, ~30 of those files had an NPE in 
> XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167570#comment-15167570
 ] 

ASF GitHub Bot commented on TIKA-1870:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/77


> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-25 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1870.
--
Resolution: Fixed

Thanks for preparing patches for all this work. Merged and pushed!

> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Refector RichTextContentHandler for TIKA-1870 c...

2016-02-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/77


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager

2016-02-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167550#comment-15167550
 ] 

Hudson commented on TIKA-1874:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #914 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/914/])
TIKA-1874 fix potential npe (tallison: rev 
0c030081bba17e607f8c79a0b95f72935be93efd)
* 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java


> Fix rare npe in XWPFListManager
> ---
>
> Key: TIKA-1874
> URL: https://issues.apache.org/jira/browse/TIKA-1874
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> Many thanks to [~centic]'s 
> [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload],
>  I recently grabbed .docx files from the initial index that comes with that 
> code.  I'll be adding these docs to our regular regression testing for 
> TIKA-1302.
> While running Tika on these ~166k docs, ~30 of those files had an NPE in 
> XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1874) Fix rare npe in XWPFListManager

2016-02-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1874:
--
Description: 
Many thanks to [~centic]'s 
[CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload],
 I recently grabbed .docx files from the initial index that comes with that 
code.  I'll be adding these docs to our regular regression testing for 
TIKA-1302.

While running Tika on these ~166k docs, ~30 of those files had an NPE in 
XWPFListManager.  We need to add a null check.

  was:
Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently grabbed 
.docx files from the initial index that comes with that code.

While running Tika on these ~166k docs, ~30 of those files had an NPE in 
XWPFListManager.  We need to add a null check.


> Fix rare npe in XWPFListManager
> ---
>
> Key: TIKA-1874
> URL: https://issues.apache.org/jira/browse/TIKA-1874
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> Many thanks to [~centic]'s 
> [CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocumentDownload],
>  I recently grabbed .docx files from the initial index that comes with that 
> code.  I'll be adding these docs to our regular regression testing for 
> TIKA-1302.
> While running Tika on these ~166k docs, ~30 of those files had an NPE in 
> XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1874) Fix rare npe in XWPFListManager

2016-02-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1874.
---
Resolution: Fixed

> Fix rare npe in XWPFListManager
> ---
>
> Key: TIKA-1874
> URL: https://issues.apache.org/jira/browse/TIKA-1874
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently 
> grabbed .docx files from the initial index that comes with that code.
> While running Tika on these ~166k docs, ~30 of those files had an NPE in 
> XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1874) Fix rare npe in XWPFListManager

2016-02-25 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1874:
-

 Summary: Fix rare npe in XWPFListManager
 Key: TIKA-1874
 URL: https://issues.apache.org/jira/browse/TIKA-1874
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Trivial


Many thanks to [~centic]'s CommonCrawlDocumentDownload code, I recently grabbed 
.docx files from the initial index that comes with that code.

While running Tika on these ~166k docs, ~30 of those files had an NPE in 
XWPFListManager.  We need to add a null check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-25 Thread John Patrick (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167296#comment-15167296
 ] 

John Patrick commented on TIKA-1870:


Added JavaDoc and Unit Test, although I'm assuming I've documented and tested 
it correctly.

> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167231#comment-15167231
 ] 

Nick Burch commented on TIKA-1865:
--

IIRC it needs the "fixed length properties" support to be completed to be able 
to get out

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167211#comment-15167211
 ] 

Tim Allison commented on TIKA-1865:
---

Good to hear from you, [~lfcnassif]!

I've only looked at this very briefly, but it looks like POI does not currently 
make the sender email address available.  I think the best next step would be 
to figure out how to modify POI to make this info available.  Any interest in 
looking into this?

I did see that the email address exists _sometimes_ in the header {{From:}}, 
and we could pull it out via regex, but several of our test MSG files clearly 
have the sender email in the bytes but have no headers.


> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167208#comment-15167208
 ] 

Tim Allison commented on TIKA-1607:
---

Aside from XMP, I can't think of an example where we'd have multiple DOMs of 
the same type (property name).  For some (rare) PDF files, I could see having a 
DOM for XFA and one or more DOMs for XMP, but they'd be under different 
keys...in my current plan.

I could also see someone modifying an existing parser to generate a DOM to this 
type of field, say, by translating what we're pulling out of the metadata for a 
multimedia file into pbcore.


On the one hand, this is a hack on the way to your unified DOM proposal...basic 
users can get what they want from key/value, and advanced users who actually 
know a given standard can find what they need.

On the other, this would allow advanced users to extract potentially 
conflicting metadata (one XMP packet has dc:creator X, but the update XMP 
packet has dc:creator Y...and we even have this in one of our test files :)).  
By following the XMP standard (iirc), the more recent packet information would 
overwrite the earlier.  Some users will want the "standard" (dc:creator=Y); 
some advanced users might want "all" (dc:creator=X;Y).


The initial motivation for giving access to the raw bytes...if we allow access 
to the raw bytes for a DOM, this could also allow super advanced users to run 
their own content stripping that might not care about slightly dodgy/invalid 
xml, and we already have an example of invalid XMP in one of our multimedia 
files.

However, I'm persuaded that making "bytes" available could lead to disaster.


> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167135#comment-15167135
 ] 

Ray Gauss II commented on TIKA-1607:


I know there can be multiple XMP packets in a single file, but do we have many 
other examples where we'd need multiple DOMs associated with a single file?

I'm trying to understand if the metadata is really the right place for this.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167123#comment-15167123
 ] 

Nick Burch commented on TIKA-1855:
--

Currently, we have most test documents in Tika Parsers, and a handful in Tika 
Core, which is sometimes confusing. We also end up with quite a lot of the unit 
tests for Tika Core actually being in the Tika Parsers test area, so that they 
can use the test documents in parsers which aren't in core. Based on my 
experiences with this (eg where I start putting things in the wrong module, 
initially can't find the right unit test etc), I find it non-ideal, and I 
suspect it's not intuitive at all for new contributors.

For the Ogg Vorbis stuff I maintain, I've opted to put all of the test files 
needed in {{core/src/test/resources}} then have the other maven modules (eg the 
Tika one and the Tools one) depend on the core-test artifact as a test-scope 
dependency in order for their unit tests to access the common set of test 
files. I find this actually works quite well, now I have it set up, and it 
seems ok for both InputStream and File based tests

So, given the above two, I would suggest that we put all of our test documents 
from core, parsers, server and bundle (all of which seem to have their own ones 
at the moment!) into a single artifact. We then depend on that artifact for all 
of our tests, with a test scope

> TIka 2.0 - Move shared test-code back to tika-core and distribute test files 
> to parser modules
> --
>
> Key: TIKA-1855
> URL: https://issues.apache.org/jira/browse/TIKA-1855
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Undo TIKA-1851, and divide test docs to appropriate parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167012#comment-15167012
 ] 

Nick Burch commented on TIKA-1873:
--

Interesting stuff! I'd skip most container-based formats, and especially OLE2 
formats though. With OLE2 the only bit you can be sure of is the 512/4096 (1 
block) header at the start, which basically says "I'm OLE2". After that, you 
can put the blocks in any order, so one file could have the first bit of word 
data starting at 513 bytes, another could have that as the last 512 bytes of 
the file, and both are valid!

> Test Cases failed when tika-mimetypes.xml is changed
> 
>
> Key: TIKA-1873
> URL: https://issues.apache.org/jira/browse/TIKA-1873
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Antriksh Saxena
>  Labels: test
>
> The test cases were failing when tika was built after updating the 
> tika-mimetypes.xml. The failure logs are as follows.
> {code}
> TestContainerAwareDetector.testTruncatedFiles:395 
> expected: but was:
>   TestMimeTypes.testOLE2Detection:138->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testOldExcel:251->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testVisioDetection:305->assertTypeByNameAndData:1071 
> expected: but was:
>   ExcelParserTest.testExcel95:320 expected: but 
> was:
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)