[jira] [Commented] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171368#comment-15171368
 ] 

ASF GitHub Bot commented on TIKA-1876:
--

GitHub user manalishah opened a pull request:

https://github.com/apache/tika/pull/80

Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manalishah/tika TIKA-1876

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/80.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #80


commit c809690ec87ffa600018dbc5eee6d6756645adb0
Author: manali 
Date:   2016-02-27T03:58:06Z

fix for TIKA-1876 contributed by manalishah

commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707
Author: manali 
Date:   2016-02-27T04:00:05Z

fix for TIKA-1876 contributed by manalishah

commit 114d0ff24bd04395852012a3382d50c3e906e6db
Author: manali 
Date:   2016-02-27T04:06:20Z

fix for TIKA-1876 contributed by manalishah

commit cdb684d9c1b0ebb01a783180f07417760fa04d6f
Author: manali 
Date:   2016-02-27T10:10:06Z

fix for TIKA-1876 contributed by manalishah




> Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity 
> Recognition
> ---
>
> Key: TIKA-1876
> URL: https://issues.apache.org/jira/browse/TIKA-1876
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
>Reporter: Manali Shah
> Fix For: 1.13
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi all, 
> Apache Tika already performs Named Entity Recognition using Open NLP and 
> Stanford Core NLP. Natural Language Toolkit is another open source python 
> library and I believe it will be a great idea to have NLTK integrated along 
> with Tika. 
> NLTK can extract NER as well as classify them. For this purpose I, along with 
> Prof Chris Mattmann have published NLTKRest, a python pip/setuptools 
> installable module that exposes NLTK as a REST service. 
> I have tested the working of Tika along with NLTKRest on my local repository 
> and will soon submit a pull request. 
> Link to rest server: https://github.com/manalishah/NLTKRest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Integrate NLTK with Tika fix for TIKA-1876 cont...

2016-02-28 Thread manalishah
GitHub user manalishah opened a pull request:

https://github.com/apache/tika/pull/80

Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manalishah/tika TIKA-1876

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/80.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #80


commit c809690ec87ffa600018dbc5eee6d6756645adb0
Author: manali 
Date:   2016-02-27T03:58:06Z

fix for TIKA-1876 contributed by manalishah

commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707
Author: manali 
Date:   2016-02-27T04:00:05Z

fix for TIKA-1876 contributed by manalishah

commit 114d0ff24bd04395852012a3382d50c3e906e6db
Author: manali 
Date:   2016-02-27T04:06:20Z

fix for TIKA-1876 contributed by manalishah

commit cdb684d9c1b0ebb01a783180f07417760fa04d6f
Author: manali 
Date:   2016-02-27T10:10:06Z

fix for TIKA-1876 contributed by manalishah




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-28 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171217#comment-15171217
 ] 

Luis Filipe Nassif commented on TIKA-1824:
--

Well, PDF also can be an attachment, office documents can be into a zip file, 
and PDF and zip are in its own modules. So I think it is OK to create an email 
module.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle

2016-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171141#comment-15171141
 ] 

Hudson commented on TIKA-1860:
--

SUCCESS: Integrated in tika-2.x #40 (See 
[https://builds.apache.org/job/tika-2.x/40/])
TIKA-1860 - Added license. (bob: rev 589f64125a206263319b33cec820f12c15f0e06f)
* 
tika-parser-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/module/crypto/internal/Activator.java
* 
tika-parser-modules/tika-parser-code-module/src/main/java/org/apache/tika/module/code/internal/Activator.java
* 
tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/module/cad/internal/Activator.java
* 
tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/module/advanced/internal/Activator.java


> Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
> 
>
> Key: TIKA-1860
> URL: https://issues.apache.org/jira/browse/TIKA-1860
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create a replacement for the OSGi tika-bundle project out of the new 
> tika-parser-* modules



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-02-28 Thread Namitha Sanjeeva Ganiga (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170969#comment-15170969
 ] 

Namitha Sanjeeva Ganiga edited comment on TIKA-1877 at 2/28/16 8:39 AM:


I also noted this same issue.. 
>From the descriptions on the .fits file format, looks like there are 20 spaces 
>from "=" to "T".  Tika parser is behaving the same now.


If we check the file that has been classified as octet-stream, we see that 
there are 16 spaces between "=" and "T". (That is why it is getting classified 
as octet-stream and not application/fits.

The question then would be , if the files( like the ones attached, that has 16 
spaces) need to be classified into application/fits? As the file is similar to 
a fits file which is already classified as application/fits.

Reference :
http://fits.gsfc.nasa.gov/standard30/fits_standard30.pdf
https://tools.ietf.org/html/rfc4047



was (Author: gan...@usc.edu):
I also noted this same issue.. 
>From the descriptions on the .fits file format, looks like there are 20 spaces 
>from "=" to "T".  Tika parser is behaving the same now.


If we check the file that has been classified as octet-stream, we see that 
there are 16 spaces between "=" and "T". (That is why it is getting classified 
as octet-stream and not application/fits.

The question then would be , if the files( like the ones attached, that has 16 
spaces) need to be classified into application/fits?

Reference :
http://fits.gsfc.nasa.gov/standard30/fits_standard30.pdf
https://tools.ietf.org/html/rfc4047


> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-02-28 Thread Namitha Sanjeeva Ganiga (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170969#comment-15170969
 ] 

Namitha Sanjeeva Ganiga commented on TIKA-1877:
---

I also noted this same issue.. 
>From the descriptions on the .fits file format, looks like there are 20 spaces 
>from "=" to "T".  Tika parser is behaving the same now.


If we check the file that has been classified as octet-stream, we see that 
there are 16 spaces between "=" and "T". (That is why it is getting classified 
as octet-stream and not application/fits.

The question then would be , if the files( like the ones attached, that has 16 
spaces) need to be classified into application/fits?

Reference :
http://fits.gsfc.nasa.gov/standard30/fits_standard30.pdf
https://tools.ietf.org/html/rfc4047


> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)