[jira] [Commented] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171368#comment-15171368
 ] 

ASF GitHub Bot commented on TIKA-1876:
--

GitHub user manalishah opened a pull request:

https://github.com/apache/tika/pull/80

Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manalishah/tika TIKA-1876

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/80.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #80


commit c809690ec87ffa600018dbc5eee6d6756645adb0
Author: manali 
Date:   2016-02-27T03:58:06Z

fix for TIKA-1876 contributed by manalishah

commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707
Author: manali 
Date:   2016-02-27T04:00:05Z

fix for TIKA-1876 contributed by manalishah

commit 114d0ff24bd04395852012a3382d50c3e906e6db
Author: manali 
Date:   2016-02-27T04:06:20Z

fix for TIKA-1876 contributed by manalishah

commit cdb684d9c1b0ebb01a783180f07417760fa04d6f
Author: manali 
Date:   2016-02-27T10:10:06Z

fix for TIKA-1876 contributed by manalishah




> Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity 
> Recognition
> ---
>
> Key: TIKA-1876
> URL: https://issues.apache.org/jira/browse/TIKA-1876
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
>Reporter: Manali Shah
> Fix For: 1.13
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi all, 
> Apache Tika already performs Named Entity Recognition using Open NLP and 
> Stanford Core NLP. Natural Language Toolkit is another open source python 
> library and I believe it will be a great idea to have NLTK integrated along 
> with Tika. 
> NLTK can extract NER as well as classify them. For this purpose I, along with 
> Prof Chris Mattmann have published NLTKRest, a python pip/setuptools 
> installable module that exposes NLTK as a REST service. 
> I have tested the working of Tika along with NLTKRest on my local repository 
> and will soon submit a pull request. 
> Link to rest server: https://github.com/manalishah/NLTKRest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167570#comment-15167570
 ] 

ASF GitHub Bot commented on TIKA-1870:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/77


> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-02-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172890#comment-15172890
 ] 

ASF GitHub Bot commented on TIKA-1877:
--

GitHub user prasadns14 opened a pull request:

https://github.com/apache/tika/pull/81

fix for TIKA-1877 contributed by prasadns14

Updated the tika-mimetypes.xml
Also, added a new .fits file to test-documents and created a unit test too.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/prasadns14/tika TIKA-1877

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/81.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #81


commit 602d237feec48bfd97bc2b2b38ea614b1ae2c55d
Author: prasadns14 
Date:   2016-02-29T23:03:13Z

fix for TIKA-1877 contributed by prasadns14




> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1840) No way to link slide notes to slide in PPT output.

2016-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112270#comment-15112270
 ] 

ASF GitHub Bot commented on TIKA-1840:
--

GitHub user zetisam opened a pull request:

https://github.com/apache/tika/pull/72

fix for TIKA-1840 contributed by zetisam



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zetisam/tika TIKA-1840

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/72.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #72


commit 52b82bddef7c7ae8a430c9871594295e71882055
Author: Sam Heijens 
Date:   2016-01-22T10:09:48Z

fix for TIKA-1840 contributed by zetisam




> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1840) No way to link slide notes to slide in PPT output.

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114576#comment-15114576
 ] 

ASF GitHub Bot commented on TIKA-1840:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/72


> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174859#comment-15174859
 ] 

ASF GitHub Bot commented on TIKA-1857:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/74


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173509#comment-15173509
 ] 

ASF GitHub Bot commented on TIKA-1882:
--

GitHub user mkampasi opened a pull request:

https://github.com/apache/tika/pull/82

Fix for TIKA-1882

The following mime magic has been added to tika-mimetypes.xml to better 
detect the below mime-types:

1. **application/vnd.ms-cab-compressed (.cab files)** - pattern "MCSF" in 
the first 4 bytes
2.  **application/vnd.xara (.xar files)** - pattern "xar!" in the first 4 
bytes
3. **application/x-mobipocket-ebook (.mobi files)** - pattern "BOOKMOBI" 
starting at byte position 60
4. **video/quicktime (.mov files)** - patterns "free" and "wide" seen 
starting at byte position 4

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkampasi/tika master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/82.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #82


commit f7433daf434a44937ba3ae8b15813a768f95e334
Author: Manisha Kampasi 
Date:   2016-03-01T07:02:55Z

Update tika-mimetypes.xml

Updated mime-magic for 4 mime types (tika-mimetypes.xml):
1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" 
starting at byte position 60
4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting 
at byte position 4




> Updating the tika-mimetypes.xml for new mime magic patterns
> ---
>
> Key: TIKA-1882
> URL: https://issues.apache.org/jira/browse/TIKA-1882
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> The following mime magic can be added to better detect the below mime-types:
> 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
> 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
> 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting 
> at byte position 60
> 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at 
> byte position 4
> The changes can be seen here:
> https://github.com/mkampasi/tika/commit/f7433daf434a44937ba3ae8b15813a768f95e334



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1881) On updating mime magic for existing mime types

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173524#comment-15173524
 ] 

ASF GitHub Bot commented on TIKA-1881:
--

GitHub user NamithaGS opened a pull request:

https://github.com/apache/tika/pull/83

Fix for TIKA-1881

Updated Mime-Magic for 6 mime types:
1. application/postscript :  files begin with pattern "%!PS-Adobe-3.0 
EPSF-3.0".
2. application/wordperfect: files begin with pattern "ÿWPC" .
3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at 
the beginning of files of tiff mime type)
4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 
5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NamithaGS/tika master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/83.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #83


commit 780100767e24505a24595ea6db43978d0700e220
Author: NamithaGS 
Date:   2016-03-01T07:21:28Z

Update tika-mimetypes.xml

Updated Mime-Magic for 6 mime types:
1. application/postscript :  files begin with pattern "%!PS-Adobe-3.0 
EPSF-3.0".
2. application/wordperfect: files begin with pattern "ÿWPC" .
3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at 
the beginning of files of tiff mime type)
4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 
5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)




> On updating mime magic for existing mime types
> --
>
> Key: TIKA-1881
> URL: https://issues.apache.org/jira/browse/TIKA-1881
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Namitha Sanjeeva Ganiga
>Priority: Minor
>  Labels: mime
> Fix For: 1.11
>
>
> Updated Mime-Magic for 6 mime types:
> 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 
> EPSF-3.0".
> 2. application/wordperfect: files begin with pattern "ÿWPC" .
> 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at 
> the beginning of files of tiff mime type)
> 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 
> 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
> 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)
> https://github.com/NamithaGS/tika/commit/780100767e24505a24595ea6db43978d0700e220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186434#comment-15186434
 ] 

ASF GitHub Bot commented on TIKA-1508:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/91

TIKA-1508 : Add uniformity to parser parameter configuration - contributed 
by Thamme Gowda

1. Added `Configurable` interface.
 This can be used for all services like `Parser`, `Detector` which can take
  configurable parameters.

2. Added `ConfigurableParser` interface which extends `Parser` interface.
   I didn't add new method to existing `Parser` because
that will break the compatibility.

3. `AbstractParser` extends `ConfigurableParser` and has
  default implementation for configure() contract.
  I think it is safe to do so and it doesn't break anything.
  In addition, all parsers which extend `AbstractParser` can easily
  access config from TikaConfig if they want to

3. Added a TODO to `TikaConfig`,
 after this should allow multiple instances of same parser with
 different runtime configurations.

4. `TikaConfig` is modified to detect if instance can be configured,
  if so, then checks if params are available in XML file, parses the
  params and invokes configure(ctx) method with these params

5. Added `DummyConfigurableParser` that simply copies parameters to
 metadata for the sake of testing

6. Added a sample XML config file for testing.
Added `ConfigurableParserTest` that performs an end to end test of all
the above.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1508

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/91.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #91


commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c
Author: Thamme Gowda 
Date:   2016-03-09T02:23:19Z

Add uniformity to parser parameter configuration.

1. Added Configurable interface.
 This can be used for all services like Parser, Detector which can take
  configurable parameters.

2. Added ConfigurableParser interface which extends Parser interface.
   I didn't add new method to existing Parser because
that will break the compatibility.

3. AbstractParser extends ConfigurableParser and has
  default implementation for configure() contract.
  I think it is safe to do so and it doesnt break anything.
  In addition all parsers which extend AbstractParser will can easily
  access config from TikaConfig if they want to

3. Added a TODO to TikaConfig,
 after this should allow multiple instances of same parser with
 different runtime configurations.

4. TikaConfig is modified to detect if instance can be configured,
  if so, then checks if params are available in XML file, parses the
  params and invokes configure(ctx) method with these params

5. Added DummyConfigurableParser that simply copies parameters to
 metadata for the sake of testing

6. Added a sample XML config file for testing.
Added ConfigurableParserTest that performs an end to end test of all
the above.

commit ae51417d8881dd90b921f02c2677a7d5bfd69a30
Author: Thamme Gowda 
Date:   2016-03-09T03:23:47Z

remove unwanted TODO:




> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1916) NPE in OpenDocumentParser

2016-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219246#comment-15219246
 ] 

ASF GitHub Bot commented on TIKA-1916:
--

GitHub user fxfixer opened a pull request:

https://github.com/apache/tika/pull/94

TIKA-1916: NPE in OpenDocumentParser

NPE in OpenDocumentParser when no "meta.xml" file exists

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fxfixer/tika patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/94.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #94


commit cb842bc9426e0a7de92eb93ac165f364af51da92
Author: fxfixer 
Date:   2016-03-31T03:02:42Z

TIKA-1916: NPE in OpenDocumentParser

NPE in OpenDocumentParser when no "meta.xml" file exists




> NPE in OpenDocumentParser
> -
>
> Key: TIKA-1916
> URL: https://issues.apache.org/jira/browse/TIKA-1916
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Nick C
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
>
> NPE in OpenDocumentParser when no "meta.xml" file exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237656#comment-15237656
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

Github user reevapp closed the pull request at:

https://github.com/apache/tika/pull/101


> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237654#comment-15237654
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

Github user reevapp closed the pull request at:

https://github.com/apache/tika/pull/103


> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237655#comment-15237655
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

Github user reevapp closed the pull request at:

https://github.com/apache/tika/pull/102


> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237662#comment-15237662
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

GitHub user reevapp opened a pull request:

https://github.com/apache/tika/pull/106

fix for TIKA-1943 contributed by Mark Duske

Support for Yandex "Translate API" Service

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reevapp/tika master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #106


commit 08e932bd75d3a8922d04fe30e7e097b6e84e6dfd
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-12T17:58:54Z

fix for TIKA-1943 contributed by Mark Duske

Includes Unit Tests for support to Yandex Translate API

commit f509917b56ea86d89102b4dae983ad82cd3fbe89
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-12T18:00:22Z

fix for TIKA-1943 contributed by Mark Duske

Properties file used by YandexTranslator

commit 86145d99df22f6f75f0602e984872bc0ef7e53f1
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-12T18:01:39Z

fix for TIKA-1943 contributed by Mark Duske

Includes support for Yandex Translate API




> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1941) Can only respond correctly to its first request and cannot assign a User-Key dynamically

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233286#comment-15233286
 ] 

ASF GitHub Bot commented on TIKA-1941:
--

GitHub user reevapp opened a pull request:

https://github.com/apache/tika/pull/100

fix for TIKA-1941 contributed by Mark Duske

Class transformed into thread-safe and allows for a Lingo24 User-Key to be 
dynamically assigned

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reevapp/tika patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #100


commit 2abf9f1abca5bc3e85a8015adc507c52938cbf74
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-09T01:49:58Z

fix for TIKA-1941 contributed by Mark Duske

Class transformed into thread-safe and allows for a Lingo24 User-Key to be 
dynamically assigned




> Can only respond correctly to its first request and cannot assign a User-Key 
> dynamically
> 
>
> Key: TIKA-1941
> URL: https://issues.apache.org/jira/browse/TIKA-1941
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
> Fix For: 1.12
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Impossible to dynamically assign a User-Key, must be in the properties file 
> in the Jar, upon setting a USer-Kwy it will only respond correctly to the 
> first request, subsequent requests will receive a non-JSON message that only 
> says that the selected language is not supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233652#comment-15233652
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

GitHub user reevapp opened a pull request:

https://github.com/apache/tika/pull/102

fix for TIKA-1943 contributed by Mark Duske

Unit tests for YandexTranslator class

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reevapp/tika patch-4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #102


commit aac7822ee800044c5512a824e976e7063170f4e5
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-09T17:29:03Z

fix for TIKA-1943 contributed by Mark Duske

Unit tests for YandexTranslator class




> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233651#comment-15233651
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

GitHub user reevapp opened a pull request:

https://github.com/apache/tika/pull/101

fix for TIKA-1943 contributed by Mark Duske

Includes support for Yandex Translate API

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reevapp/tika patch-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #101


commit 0e0ce6d47f3291c74fd5d5f083d3d162d8b2abe5
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-09T17:27:05Z

fix for TIKA-1943 contributed by Mark Duske

Includes support for Yandex Translate API




> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API

2016-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233653#comment-15233653
 ] 

ASF GitHub Bot commented on TIKA-1943:
--

GitHub user reevapp opened a pull request:

https://github.com/apache/tika/pull/103

fix for TIKA-1943 contributed by Mark Duske

Properties file used by YandexTranslator

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reevapp/tika patch-5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit a374748169248a12d81817cbc26dbe037d6f9c3d
Author: ReEvApp - Re-Evolution Applications, LLC 
Date:   2016-04-09T17:34:05Z

fix for TIKA-1943 contributed by Mark Duske

Properties file used by YandexTranslator




> Include support for Yandex Translate API
> 
>
> Key: TIKA-1943
> URL: https://issues.apache.org/jira/browse/TIKA-1943
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.12
>Reporter: Mark Duske
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Include support for Yandex' Translate API service available at 
> https://tech.yandex.com/translate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-774) ExifTool Parser

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205245#comment-15205245
 ] 

ASF GitHub Bot commented on TIKA-774:
-

GitHub user rgauss opened a pull request:

https://github.com/apache/tika/pull/92

TIKA-774: ExifTool Parser

Contribution of tika-exiftool for review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Alfresco/tika tika-exiftool

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #92


commit 8eb474b06e1463ca172128b59b713782eb4bece8
Author: rgauss 
Date:   2016-03-19T20:37:37Z

Initial commit of tika-exiftool as is

commit 5ff139d68bebd39382d5ed9626bff42797ece01d
Author: rgauss 
Date:   2016-03-19T22:44:00Z

Added git ignore of properties override

commit c8f4fb062ce809661527c91df89b230da95f592c
Author: rgauss 
Date:   2016-03-21T18:49:38Z

Merge branch 'master' into tika-exiftool

commit e8a2fa30b16f8b947d118b61ca12476420e9bee0
Author: rgauss 
Date:   2016-03-21T21:24:29Z

TIKA-774: ExifTool Parser
  - Moved tika-exiftool from separate project to parsers
  - Updated license headers
  - Removed author Javadoc
  - Fixed a few forbiddenapi violations

commit 37aae337c5ca3b5a45c2e45804e3768e08a8bbb6
Author: rgauss 
Date:   2016-03-21T21:31:31Z

TIKA-774: ExifTool Parser
  - Removed more author Javadocs

commit 90f8550c03aa873a81975dfa10cfd77aa557fc6f
Author: rgauss 
Date:   2016-03-21T22:00:00Z

TIKA-774: ExifTool Parser
  - Renamed ExecutableUtils to ExiftoolExecutableUtils
  - Changed ExifToolImageParserTest to skip when exiftool is not
available




> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.13
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182129#comment-15182129
 ] 

ASF GitHub Bot commented on TIKA-1877:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/81


> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175785#comment-15175785
 ] 

ASF GitHub Bot commented on TIKA-1816:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/84


> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>  Labels: memex
> Fix For: 1.13
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175077#comment-15175077
 ] 

ASF GitHub Bot commented on TIKA-1876:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/80


> Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity 
> Recognition
> ---
>
> Key: TIKA-1876
> URL: https://issues.apache.org/jira/browse/TIKA-1876
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
>Reporter: Manali Shah
>Assignee: Chris A. Mattmann
> Fix For: 1.13
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi all, 
> Apache Tika already performs Named Entity Recognition using Open NLP and 
> Stanford Core NLP. Natural Language Toolkit is another open source python 
> library and I believe it will be a great idea to have NLTK integrated along 
> with Tika. 
> NLTK can extract NER as well as classify them. For this purpose I, along with 
> Prof Chris Mattmann have published NLTKRest, a python pip/setuptools 
> installable module that exposes NLTK as a REST service. 
> I have tested the working of Tika along with NLTKRest on my local repository 
> and will soon submit a pull request. 
> Link to rest server: https://github.com/manalishah/NLTKRest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175318#comment-15175318
 ] 

ASF GitHub Bot commented on TIKA-1816:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/84

TIKA1816 : NER model download via maven proxy ( from 1.x to 2.x)

This PR brings proxy based downloading feature from 1.x branch to 2.x

Closes  TIKA-1816 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika 2.x-TIKA-1816

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/84.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #84


commit c4feaff19187f548730f48a77fc437ca12bb40b4
Author: Thamme Gowda 
Date:   2016-03-02T09:12:26Z

Copy Proxy download fix to 2.x




> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>  Labels: memex
> Fix For: 1.13
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177637#comment-15177637
 ] 

ASF GitHub Bot commented on TIKA-1841:
--

GitHub user zetisam opened a pull request:

https://github.com/apache/tika/pull/86

fix for TIKA-1841 contributed by zetisam



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zetisam/tika TIKA-1841

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/86.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #86


commit ea82d8538dbd7a1f68d4d290ad0c115f62b29c76
Author: Sam Heijens 
Date:   2016-02-15T15:09:51Z

fix for TIKA-1841 contributed by zetisam




> Different XML output structure for PPT and PPTX
> ---
>
> Key: TIKA-1841
> URL: https://issues.apache.org/jira/browse/TIKA-1841
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> 
> 
>  //optional
>  //optional
> ...
> 
> 
>  //optional
>  //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> 
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1883) Identification of Mime Type for Empty Files

2016-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178469#comment-15178469
 ] 

ASF GitHub Bot commented on TIKA-1883:
--

GitHub user adityardesai opened a pull request:

https://github.com/apache/tika/pull/87

Fix for TIKA-1883 and 1884

TIKA 1883
Identification of Mime types for empty files, updating TIKA 1.12 source 
code to fix this issue. The Tika Detector and Parsers have been modified 
accordingly to identify the empty files and classify them.

TIKA 1884
Updating Tika's Mime Repository with the following file types
1. .sfdu - Standard Formatted Data Unit
2. .CDF - Common Data Format having magic byte CDF with 0 offset
Tika Mime Repository is updated with these file types.

The updated codes is available at 

https://github.com/RashmiNalwad/MIME-Type-Identification-of-TREC-POLAR-DATASET



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/87.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #87


commit 1a3749fa632fdb8ad0bcb2cea673113031f9b4be
Author: Chris Mattmann 
Date:   2015-06-25T17:54:55Z

Fix for TIKA-1659 ZipContainerDetector does not detect all IPA files 
contributed by Rami Shomali  this closes #51.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1687594 
13f79535-47bb-0310-9956-ffa450edef68

commit 90a2202b5b4a75e7f673bfb42a912cb97ae6d26e
Author: Tim Allison 
Date:   2015-06-28T01:57:30Z

TIKA-1663 add a DigestingParser

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1687981 
13f79535-47bb-0310-9956-ffa450edef68

commit 444dadd5eb090f6e2998507e444b2014905cb90f
Author: Chris Mattmann 
Date:   2015-06-29T05:19:48Z

Fix for TIKA-1664: GDALParser now correctly sets nitf as a supported media 
type contributed by Joseph North  this closes #53.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688086 
13f79535-47bb-0310-9956-ffa450edef68

commit 761273f9e69c4a7595e50ccd6a2d9304c398d0b1
Author: Chris Mattmann 
Date:   2015-06-29T05:26:52Z

Fix for TIKA-1669: xpath node test ./node() should match all contained 
nodes contributed by WulfB  this closes #52

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688087 
13f79535-47bb-0310-9956-ffa450edef68

commit fd8514c2c512d9dcc1039aadf1dbc64c1ff6d3fc
Author: Chris Mattmann 
Date:   2015-06-29T14:34:29Z

Rollback r1688087 as it seems to cause some tests to fail.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688239 
13f79535-47bb-0310-9956-ffa450edef68

commit 2a47d9aa340d529f027c94f3c233645fb2f8bf7e
Author: Tim Allison 
Date:   2015-06-30T00:48:03Z

TIKA-1601: integrate Jackcess to parse MSAccess files

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688337 
13f79535-47bb-0310-9956-ffa450edef68

commit 06cfbaafeb308bd979fd2214a4b1a15353a9b4ab
Author: Chris Mattmann 
Date:   2015-07-01T13:21:41Z

Fix for TIKA-1602: Detecting standards-non-compliant emails as 
message/rfc822 contributed by Jeremy B. Merrill  
this closes #40.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688647 
13f79535-47bb-0310-9956-ffa450edef68

commit 425506e90500dadcccf82fd66aa15ce14d23facc
Author: Tyler Palsulich 
Date:   2015-07-02T08:13:00Z

TIKA-1536. Upgrade to Java 1.7.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688779 
13f79535-47bb-0310-9956-ffa450edef68

commit 4695df5672492c38a8abcd230c8545f982a7f65d
Author: Tyler Palsulich 
Date:   2015-07-02T08:14:48Z

TIKA-1536. Update CHANGES.txt with upgrade to Java 7.

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688780 
13f79535-47bb-0310-9956-ffa450edef68

commit de5a2dec6924ebe01e4bf323a98abd208cf9aa7e
Author: Nick Burch 
Date:   2015-07-02T10:35:06Z

Remove change comment, TIKA-1602

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688805 
13f79535-47bb-0310-9956-ffa450edef68

commit 2764fb8606964c3350c781ecf5df4042706b4099
Author: Tim Allison 
Date:   2015-07-02T13:47:23Z

TIKA-1673 drop source file name from embedded file path; made a few java 7 
updates; added timing for embedded docs

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688827 
13f79535-47bb-0310-9956-ffa450edef68

commit 

[jira] [Commented] (TIKA-1926) JSON TEI Exception

2016-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223560#comment-15223560
 ] 

ASF GitHub Bot commented on TIKA-1926:
--

GitHub user hasanayesha opened a pull request:

https://github.com/apache/tika/pull/97

fix for TIKA-1926 contributed by hasanayesha

JSON TEI Exception Handled.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hasanayesha/tika TIKA-1926

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #97


commit 63bb15467fcd6e8766b0361e78231f6f7f6a4a08
Author: hasanayesha 
Date:   2016-04-03T23:52:10Z

fix for TIKA-1926 contributed by hasanayesha




> JSON TEI Exception
> --
>
> Key: TIKA-1926
> URL: https://issues.apache.org/jira/browse/TIKA-1926
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Ayesha Hasan
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.12
>
>
> JSONException being thrown by grobid when the json TEI object wasn't found.
> Fixed it by adding a try and catch block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1916) NPE in OpenDocumentParser

2016-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223990#comment-15223990
 ] 

ASF GitHub Bot commented on TIKA-1916:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/94


> NPE in OpenDocumentParser
> -
>
> Key: TIKA-1916
> URL: https://issues.apache.org/jira/browse/TIKA-1916
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Nick C
>Assignee: Tim Allison
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: MissingMeta.odt
>
>
> NPE in OpenDocumentParser when no "meta.xml" file exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1927) NPE in JDBCTableReader

2016-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224207#comment-15224207
 ] 

ASF GitHub Bot commented on TIKA-1927:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/98


> NPE in JDBCTableReader
> --
>
> Key: TIKA-1927
> URL: https://issues.apache.org/jira/browse/TIKA-1927
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Nick C
>Assignee: Tim Allison
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.13
>
>
> NPE when there is a null String in a SQLite DB.
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.tika.parser.jdbc.JDBCTableReader.addAllCharacters(JDBCTableReader.java:252)
>   at 
> org.apache.tika.parser.jdbc.JDBCTableReader.handleCell(JDBCTableReader.java:135)
>   at 
> org.apache.tika.parser.jdbc.JDBCTableReader.nextRow(JDBCTableReader.java:95)
>   at 
> org.apache.tika.parser.jdbc.AbstractDBParser.parse(AbstractDBParser.java:90)
>   at 
> org.apache.tika.parser.jdbc.SQLite3Parser.parse(SQLite3Parser.java:78)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1925) Composite External Parser like Exiftool fails to run on Windows.

2016-04-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222655#comment-15222655
 ] 

ASF GitHub Bot commented on TIKA-1925:
--

GitHub user mit2nil opened a pull request:

https://github.com/apache/tika/pull/96

fix for TIKA-1925 contributed by Nilay Chheda

@chrismattmann Please review the change and let me know they can be 
contributed back to Tika. 
Issue description: 
[https://issues.apache.org/jira/browse/TIKA-1925](https://issues.apache.org/jira/browse/TIKA-1925)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mit2nil/tika TIKA-1925

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/96.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #96


commit c6e2b028beed78e66a80e7e22bc5d9f74b240dbe
Author: mit2nil 
Date:   2016-04-02T02:11:08Z

fix for TIKA-1925 contributed by Nilay Chheda




> Composite External Parser like Exiftool fails to run on Windows.
> 
>
> Key: TIKA-1925
> URL: https://issues.apache.org/jira/browse/TIKA-1925
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.12
> Environment: Windows 10, Intel i7 6550U 64-Bit processor
>Reporter: Nilay Chheda
> Fix For: 1.13
>
> Attachments: ExternalParser_modified.java, ExternalParser_orig.java
>
>
> While trying to run EXIFTool Parser using Tika on Windows OS, we are getting 
> following error output. 
> (Ref: http://wiki.apache.org/tika/EXIFToolParser)
> java.io.IOException: Cannot run program "env": CreateProcess error=2, The 
> system cannot find the file specified
> at java.lang.ProcessBuilder.start(Unknown Source)
> at java.lang.Runtime.exec(Unknown Source)
> at java.lang.Runtime.exec(Unknown Source)
> at 
> org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:182)
> at 
> org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:145)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> Caused by: java.io.IOException: CreateProcess error=2, The system cannot find 
> the file specified
> at java.lang.ProcessImpl.create(Native Method)
> at java.lang.ProcessImpl.(Unknown Source)
> at java.lang.ProcessImpl.start(Unknown Source)
> ... 13 more
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.external.ExternalParser@51efea79
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:218)
> at 
> org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:145)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 7 more
> After analyzing the stack trace and little experimentation, we found that 
> "env" is unix/Mac OS X/Linux specific command and does not work on Windows. 
> We were able to workaround this problem by adding some Windows specific code, 
> recompile Tika and run again with similar setup. I am attaching the original 
> file and modified file for review. 
> If fix is acceptable by Tika specific standards, I can send the pull request 
> on Github to contribute the patch. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1914) ExecutableParser doesn't call start document

2016-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224842#comment-15224842
 ] 

ASF GitHub Bot commented on TIKA-1914:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/93


> ExecutableParser doesn't call start document
> 
>
> Key: TIKA-1914
> URL: https://issues.apache.org/jira/browse/TIKA-1914
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Nick C
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
>
> The ExecutableParser doesn't call start document which causes errors when 
> producing XHTML 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files

2016-04-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257333#comment-15257333
 ] 

ASF GitHub Bot commented on TIKA-1893:
--

GitHub user mkampasi opened a pull request:

https://github.com/apache/tika/pull/110

fix for TIKA-1893 contributed by mkampasi

Added a custom parser class for parsing Apple ICNS files.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkampasi/tika master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #110


commit 0cdf17df913cadf23c8074d46d386fa230f8198c
Author: mkampasi 
Date:   2016-04-26T00:20:38Z

Adding parser for ICNS files




> Add new mimetype for *.icns (Apple Icon Image Format) files 
> 
>
> Key: TIKA-1893
> URL: https://issues.apache.org/jira/browse/TIKA-1893
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> Currently, TIKA does not support the "image/icns" mime type for *.icns files 
> (Apple Icon Image Format). This can be added to the tika-mimetypes.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260457#comment-15260457
 ] 

ASF GitHub Bot commented on TIKA-1885:
--

Github user adeshgupta closed the pull request at:

https://github.com/apache/tika/pull/89


> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1938) HtmlParser drops

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260771#comment-15260771
 ] 

ASF GitHub Bot commented on TIKA-1938:
--

GitHub user naegelejd opened a pull request:

https://github.com/apache/tika/pull/111

fix for TIKA-1938 contributed by naegelejd

Adds HtmlParser support for  tags within 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/naegelejd/tika TIKA-1938

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #111


commit b6d23c189e852fa2e41b441c18bfe3e66e3f67c4
Author: Joseph Naegele 
Date:   2016-04-27T18:35:11Z

fix for TIKA-1938 contributed by naegelejd

add HtmlParser support for