[jira] [Commented] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition
[ https://issues.apache.org/jira/browse/TIKA-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171368#comment-15171368 ] ASF GitHub Bot commented on TIKA-1876: -- GitHub user manalishah opened a pull request: https://github.com/apache/tika/pull/80 Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah You can merge this pull request into a Git repository by running: $ git pull https://github.com/manalishah/tika TIKA-1876 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit c809690ec87ffa600018dbc5eee6d6756645adb0 Author: manaliDate: 2016-02-27T03:58:06Z fix for TIKA-1876 contributed by manalishah commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707 Author: manali Date: 2016-02-27T04:00:05Z fix for TIKA-1876 contributed by manalishah commit 114d0ff24bd04395852012a3382d50c3e906e6db Author: manali Date: 2016-02-27T04:06:20Z fix for TIKA-1876 contributed by manalishah commit cdb684d9c1b0ebb01a783180f07417760fa04d6f Author: manali Date: 2016-02-27T10:10:06Z fix for TIKA-1876 contributed by manalishah > Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity > Recognition > --- > > Key: TIKA-1876 > URL: https://issues.apache.org/jira/browse/TIKA-1876 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 >Reporter: Manali Shah > Fix For: 1.13 > > Original Estimate: 168h > Remaining Estimate: 168h > > Hi all, > Apache Tika already performs Named Entity Recognition using Open NLP and > Stanford Core NLP. Natural Language Toolkit is another open source python > library and I believe it will be a great idea to have NLTK integrated along > with Tika. > NLTK can extract NER as well as classify them. For this purpose I, along with > Prof Chris Mattmann have published NLTKRest, a python pip/setuptools > installable module that exposes NLTK as a REST service. > I have tested the working of Tika along with NLTKRest on my local repository > and will soon submit a pull request. > Link to rest server: https://github.com/manalishah/NLTKRest -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167570#comment-15167570 ] ASF GitHub Bot commented on TIKA-1870: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/77 > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172890#comment-15172890 ] ASF GitHub Bot commented on TIKA-1877: -- GitHub user prasadns14 opened a pull request: https://github.com/apache/tika/pull/81 fix for TIKA-1877 contributed by prasadns14 Updated the tika-mimetypes.xml Also, added a new .fits file to test-documents and created a unit test too. You can merge this pull request into a Git repository by running: $ git pull https://github.com/prasadns14/tika TIKA-1877 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/81.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #81 commit 602d237feec48bfd97bc2b2b38ea614b1ae2c55d Author: prasadns14Date: 2016-02-29T23:03:13Z fix for TIKA-1877 contributed by prasadns14 > On updating the tika-mimetypes.xml to detect .fts file format, tika detector > does not return anything > - > > Key: TIKA-1877 > URL: https://issues.apache.org/jira/browse/TIKA-1877 > Project: Tika > Issue Type: Bug > Components: mime >Reporter: Prasad Nagaraj Subramanya >Priority: Minor > Attachments: > 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, > 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, > tika-mimetypes.xml > > > The match value for .fts file format in tika-mimetypes.xml is "SIMPLE = > T". > Tika detected a .fts file as application/octet-stream. On verifying the > header I found the value to be "SIMPLE =T"(just 16 spaces > before = and T) > I tried the following changes- > Change 1) Updated the existing match value. But the build failed > Change 2) Added a new match value type="string" offset="0"/> after the existing one. > But now, tika returns empty value. It neither identifies the file as .fts nor > as application/octet-stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112270#comment-15112270 ] ASF GitHub Bot commented on TIKA-1840: -- GitHub user zetisam opened a pull request: https://github.com/apache/tika/pull/72 fix for TIKA-1840 contributed by zetisam You can merge this pull request into a Git repository by running: $ git pull https://github.com/zetisam/tika TIKA-1840 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/72.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #72 commit 52b82bddef7c7ae8a430c9871594295e71882055 Author: Sam HeijensDate: 2016-01-22T10:09:48Z fix for TIKA-1840 contributed by zetisam > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114576#comment-15114576 ] ASF GitHub Bot commented on TIKA-1840: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/72 > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174859#comment-15174859 ] ASF GitHub Bot commented on TIKA-1857: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/74 > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns
[ https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173509#comment-15173509 ] ASF GitHub Bot commented on TIKA-1882: -- GitHub user mkampasi opened a pull request: https://github.com/apache/tika/pull/82 Fix for TIKA-1882 The following mime magic has been added to tika-mimetypes.xml to better detect the below mime-types: 1. **application/vnd.ms-cab-compressed (.cab files)** - pattern "MCSF" in the first 4 bytes 2. **application/vnd.xara (.xar files)** - pattern "xar!" in the first 4 bytes 3. **application/x-mobipocket-ebook (.mobi files)** - pattern "BOOKMOBI" starting at byte position 60 4. **video/quicktime (.mov files)** - patterns "free" and "wide" seen starting at byte position 4 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkampasi/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/82.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #82 commit f7433daf434a44937ba3ae8b15813a768f95e334 Author: Manisha KampasiDate: 2016-03-01T07:02:55Z Update tika-mimetypes.xml Updated mime-magic for 4 mime types (tika-mimetypes.xml): 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting at byte position 60 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at byte position 4 > Updating the tika-mimetypes.xml for new mime magic patterns > --- > > Key: TIKA-1882 > URL: https://issues.apache.org/jira/browse/TIKA-1882 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.11 >Reporter: Manisha Kampasi >Priority: Minor > Labels: patch > > The following mime magic can be added to better detect the below mime-types: > 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes > 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes > 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting > at byte position 60 > 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at > byte position 4 > The changes can be seen here: > https://github.com/mkampasi/tika/commit/f7433daf434a44937ba3ae8b15813a768f95e334 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1881) On updating mime magic for existing mime types
[ https://issues.apache.org/jira/browse/TIKA-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173524#comment-15173524 ] ASF GitHub Bot commented on TIKA-1881: -- GitHub user NamithaGS opened a pull request: https://github.com/apache/tika/pull/83 Fix for TIKA-1881 Updated Mime-Magic for 6 mime types: 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0". 2. application/wordperfect: files begin with pattern "ÿWPC" . 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type) 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50) 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50) You can merge this pull request into a Git repository by running: $ git pull https://github.com/NamithaGS/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 780100767e24505a24595ea6db43978d0700e220 Author: NamithaGSDate: 2016-03-01T07:21:28Z Update tika-mimetypes.xml Updated Mime-Magic for 6 mime types: 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0". 2. application/wordperfect: files begin with pattern "ÿWPC" . 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type) 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50) 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50) > On updating mime magic for existing mime types > -- > > Key: TIKA-1881 > URL: https://issues.apache.org/jira/browse/TIKA-1881 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.11 >Reporter: Namitha Sanjeeva Ganiga >Priority: Minor > Labels: mime > Fix For: 1.11 > > > Updated Mime-Magic for 6 mime types: > 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 > EPSF-3.0". > 2. application/wordperfect: files begin with pattern "ÿWPC" . > 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at > the beginning of files of tiff mime type) > 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) > 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50) > 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50) > https://github.com/NamithaGS/tika/commit/780100767e24505a24595ea6db43978d0700e220 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186434#comment-15186434 ] ASF GitHub Bot commented on TIKA-1508: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/91 TIKA-1508 : Add uniformity to parser parameter configuration - contributed by Thamme Gowda 1. Added `Configurable` interface. This can be used for all services like `Parser`, `Detector` which can take configurable parameters. 2. Added `ConfigurableParser` interface which extends `Parser` interface. I didn't add new method to existing `Parser` because that will break the compatibility. 3. `AbstractParser` extends `ConfigurableParser` and has default implementation for configure() contract. I think it is safe to do so and it doesn't break anything. In addition, all parsers which extend `AbstractParser` can easily access config from TikaConfig if they want to 3. Added a TODO to `TikaConfig`, after this should allow multiple instances of same parser with different runtime configurations. 4. `TikaConfig` is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added `DummyConfigurableParser` that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added `ConfigurableParserTest` that performs an end to end test of all the above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1508 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/91.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #91 commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c Author: Thamme GowdaDate: 2016-03-09T02:23:19Z Add uniformity to parser parameter configuration. 1. Added Configurable interface. This can be used for all services like Parser, Detector which can take configurable parameters. 2. Added ConfigurableParser interface which extends Parser interface. I didn't add new method to existing Parser because that will break the compatibility. 3. AbstractParser extends ConfigurableParser and has default implementation for configure() contract. I think it is safe to do so and it doesnt break anything. In addition all parsers which extend AbstractParser will can easily access config from TikaConfig if they want to 3. Added a TODO to TikaConfig, after this should allow multiple instances of same parser with different runtime configurations. 4. TikaConfig is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added DummyConfigurableParser that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added ConfigurableParserTest that performs an end to end test of all the above. commit ae51417d8881dd90b921f02c2677a7d5bfd69a30 Author: Thamme Gowda Date: 2016-03-09T03:23:47Z remove unwanted TODO: > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1916) NPE in OpenDocumentParser
[ https://issues.apache.org/jira/browse/TIKA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219246#comment-15219246 ] ASF GitHub Bot commented on TIKA-1916: -- GitHub user fxfixer opened a pull request: https://github.com/apache/tika/pull/94 TIKA-1916: NPE in OpenDocumentParser NPE in OpenDocumentParser when no "meta.xml" file exists You can merge this pull request into a Git repository by running: $ git pull https://github.com/fxfixer/tika patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/94.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #94 commit cb842bc9426e0a7de92eb93ac165f364af51da92 Author: fxfixerDate: 2016-03-31T03:02:42Z TIKA-1916: NPE in OpenDocumentParser NPE in OpenDocumentParser when no "meta.xml" file exists > NPE in OpenDocumentParser > - > > Key: TIKA-1916 > URL: https://issues.apache.org/jira/browse/TIKA-1916 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Nick C >Priority: Trivial > Labels: patch > Fix For: 1.13 > > > NPE in OpenDocumentParser when no "meta.xml" file exists -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237656#comment-15237656 ] ASF GitHub Bot commented on TIKA-1943: -- Github user reevapp closed the pull request at: https://github.com/apache/tika/pull/101 > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237654#comment-15237654 ] ASF GitHub Bot commented on TIKA-1943: -- Github user reevapp closed the pull request at: https://github.com/apache/tika/pull/103 > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237655#comment-15237655 ] ASF GitHub Bot commented on TIKA-1943: -- Github user reevapp closed the pull request at: https://github.com/apache/tika/pull/102 > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237662#comment-15237662 ] ASF GitHub Bot commented on TIKA-1943: -- GitHub user reevapp opened a pull request: https://github.com/apache/tika/pull/106 fix for TIKA-1943 contributed by Mark Duske Support for Yandex "Translate API" Service You can merge this pull request into a Git repository by running: $ git pull https://github.com/reevapp/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #106 commit 08e932bd75d3a8922d04fe30e7e097b6e84e6dfd Author: ReEvApp - Re-Evolution Applications, LLCDate: 2016-04-12T17:58:54Z fix for TIKA-1943 contributed by Mark Duske Includes Unit Tests for support to Yandex Translate API commit f509917b56ea86d89102b4dae983ad82cd3fbe89 Author: ReEvApp - Re-Evolution Applications, LLC Date: 2016-04-12T18:00:22Z fix for TIKA-1943 contributed by Mark Duske Properties file used by YandexTranslator commit 86145d99df22f6f75f0602e984872bc0ef7e53f1 Author: ReEvApp - Re-Evolution Applications, LLC Date: 2016-04-12T18:01:39Z fix for TIKA-1943 contributed by Mark Duske Includes support for Yandex Translate API > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1941) Can only respond correctly to its first request and cannot assign a User-Key dynamically
[ https://issues.apache.org/jira/browse/TIKA-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233286#comment-15233286 ] ASF GitHub Bot commented on TIKA-1941: -- GitHub user reevapp opened a pull request: https://github.com/apache/tika/pull/100 fix for TIKA-1941 contributed by Mark Duske Class transformed into thread-safe and allows for a Lingo24 User-Key to be dynamically assigned You can merge this pull request into a Git repository by running: $ git pull https://github.com/reevapp/tika patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/100.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #100 commit 2abf9f1abca5bc3e85a8015adc507c52938cbf74 Author: ReEvApp - Re-Evolution Applications, LLCDate: 2016-04-09T01:49:58Z fix for TIKA-1941 contributed by Mark Duske Class transformed into thread-safe and allows for a Lingo24 User-Key to be dynamically assigned > Can only respond correctly to its first request and cannot assign a User-Key > dynamically > > > Key: TIKA-1941 > URL: https://issues.apache.org/jira/browse/TIKA-1941 > Project: Tika > Issue Type: Bug > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske > Fix For: 1.12 > > Original Estimate: 24h > Remaining Estimate: 24h > > Impossible to dynamically assign a User-Key, must be in the properties file > in the Jar, upon setting a USer-Kwy it will only respond correctly to the > first request, subsequent requests will receive a non-JSON message that only > says that the selected language is not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233652#comment-15233652 ] ASF GitHub Bot commented on TIKA-1943: -- GitHub user reevapp opened a pull request: https://github.com/apache/tika/pull/102 fix for TIKA-1943 contributed by Mark Duske Unit tests for YandexTranslator class You can merge this pull request into a Git repository by running: $ git pull https://github.com/reevapp/tika patch-4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/102.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #102 commit aac7822ee800044c5512a824e976e7063170f4e5 Author: ReEvApp - Re-Evolution Applications, LLCDate: 2016-04-09T17:29:03Z fix for TIKA-1943 contributed by Mark Duske Unit tests for YandexTranslator class > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233651#comment-15233651 ] ASF GitHub Bot commented on TIKA-1943: -- GitHub user reevapp opened a pull request: https://github.com/apache/tika/pull/101 fix for TIKA-1943 contributed by Mark Duske Includes support for Yandex Translate API You can merge this pull request into a Git repository by running: $ git pull https://github.com/reevapp/tika patch-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/101.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #101 commit 0e0ce6d47f3291c74fd5d5f083d3d162d8b2abe5 Author: ReEvApp - Re-Evolution Applications, LLCDate: 2016-04-09T17:27:05Z fix for TIKA-1943 contributed by Mark Duske Includes support for Yandex Translate API > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1943) Include support for Yandex Translate API
[ https://issues.apache.org/jira/browse/TIKA-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233653#comment-15233653 ] ASF GitHub Bot commented on TIKA-1943: -- GitHub user reevapp opened a pull request: https://github.com/apache/tika/pull/103 fix for TIKA-1943 contributed by Mark Duske Properties file used by YandexTranslator You can merge this pull request into a Git repository by running: $ git pull https://github.com/reevapp/tika patch-5 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/103.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #103 commit a374748169248a12d81817cbc26dbe037d6f9c3d Author: ReEvApp - Re-Evolution Applications, LLCDate: 2016-04-09T17:34:05Z fix for TIKA-1943 contributed by Mark Duske Properties file used by YandexTranslator > Include support for Yandex Translate API > > > Key: TIKA-1943 > URL: https://issues.apache.org/jira/browse/TIKA-1943 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.12 >Reporter: Mark Duske >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > Include support for Yandex' Translate API service available at > https://tech.yandex.com/translate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205245#comment-15205245 ] ASF GitHub Bot commented on TIKA-774: - GitHub user rgauss opened a pull request: https://github.com/apache/tika/pull/92 TIKA-774: ExifTool Parser Contribution of tika-exiftool for review You can merge this pull request into a Git repository by running: $ git pull https://github.com/Alfresco/tika tika-exiftool Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/92.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #92 commit 8eb474b06e1463ca172128b59b713782eb4bece8 Author: rgaussDate: 2016-03-19T20:37:37Z Initial commit of tika-exiftool as is commit 5ff139d68bebd39382d5ed9626bff42797ece01d Author: rgauss Date: 2016-03-19T22:44:00Z Added git ignore of properties override commit c8f4fb062ce809661527c91df89b230da95f592c Author: rgauss Date: 2016-03-21T18:49:38Z Merge branch 'master' into tika-exiftool commit e8a2fa30b16f8b947d118b61ca12476420e9bee0 Author: rgauss Date: 2016-03-21T21:24:29Z TIKA-774: ExifTool Parser - Moved tika-exiftool from separate project to parsers - Updated license headers - Removed author Javadoc - Fixed a few forbiddenapi violations commit 37aae337c5ca3b5a45c2e45804e3768e08a8bbb6 Author: rgauss Date: 2016-03-21T21:31:31Z TIKA-774: ExifTool Parser - Removed more author Javadocs commit 90f8550c03aa873a81975dfa10cfd77aa557fc6f Author: rgauss Date: 2016-03-21T22:00:00Z TIKA-774: ExifTool Parser - Renamed ExecutableUtils to ExiftoolExecutableUtils - Changed ExifToolImageParserTest to skip when exiftool is not available > ExifTool Parser > --- > > Key: TIKA-774 > URL: https://issues.apache.org/jira/browse/TIKA-774 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.0 > Environment: Requires be installed > (http://www.sno.phy.queensu.ca/~phil/exiftool/) >Reporter: Ray Gauss II > Labels: features, new-parser, newbie, patch > Fix For: 1.13 > > Attachments: testJPEG_IPTC_EXT.jpg, > tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt > > > Adds an external parser that calls ExifTool to extract extended metadata > fields from images and other content types. > In the core project: > An ExifTool interface is added which contains Property objects that define > the metadata fields available. > An additional Property constructor for internalTextBag type. > In the parsers project: > An ExiftoolMetadataExtractor is added which does the work of calling ExifTool > on the command line and mapping the response to tika metadata fields. This > extractor could be called instead of or in addition to the existing > ImageMetadataExtractor and JempboxExtractor under TiffParser and/or > JpegParser but those have not been changed at this time. > An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. > An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool > metadata fields to existing tika and Drew Noakes metadata fields if enabled. > An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag > implementations in XML files. > An ExifToolParserTest is added which tests several expected XMP and IPTC > metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182129#comment-15182129 ] ASF GitHub Bot commented on TIKA-1877: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/81 > On updating the tika-mimetypes.xml to detect .fts file format, tika detector > does not return anything > - > > Key: TIKA-1877 > URL: https://issues.apache.org/jira/browse/TIKA-1877 > Project: Tika > Issue Type: Bug > Components: mime >Reporter: Prasad Nagaraj Subramanya >Priority: Minor > Attachments: > 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, > 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, > tika-mimetypes.xml > > > The match value for .fts file format in tika-mimetypes.xml is "SIMPLE = > T". > Tika detected a .fts file as application/octet-stream. On verifying the > header I found the value to be "SIMPLE =T"(just 16 spaces > before = and T) > I tried the following changes- > Change 1) Updated the existing match value. But the build failed > Change 2) Added a new match value type="string" offset="0"/> after the existing one. > But now, tika returns empty value. It neither identifies the file as .fts nor > as application/octet-stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175785#comment-15175785 ] ASF GitHub Bot commented on TIKA-1816: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/84 > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Labels: memex > Fix For: 1.13 > > Attachments: TIKA-1816-proxy-fix.patch > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1876) Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition
[ https://issues.apache.org/jira/browse/TIKA-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175077#comment-15175077 ] ASF GitHub Bot commented on TIKA-1876: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/80 > Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity > Recognition > --- > > Key: TIKA-1876 > URL: https://issues.apache.org/jira/browse/TIKA-1876 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.13 >Reporter: Manali Shah >Assignee: Chris A. Mattmann > Fix For: 1.13 > > Original Estimate: 168h > Remaining Estimate: 168h > > Hi all, > Apache Tika already performs Named Entity Recognition using Open NLP and > Stanford Core NLP. Natural Language Toolkit is another open source python > library and I believe it will be a great idea to have NLTK integrated along > with Tika. > NLTK can extract NER as well as classify them. For this purpose I, along with > Prof Chris Mattmann have published NLTKRest, a python pip/setuptools > installable module that exposes NLTK as a REST service. > I have tested the working of Tika along with NLTKRest on my local repository > and will soon submit a pull request. > Link to rest server: https://github.com/manalishah/NLTKRest -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175318#comment-15175318 ] ASF GitHub Bot commented on TIKA-1816: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/84 TIKA1816 : NER model download via maven proxy ( from 1.x to 2.x) This PR brings proxy based downloading feature from 1.x branch to 2.x Closes TIKA-1816 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika 2.x-TIKA-1816 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/84.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #84 commit c4feaff19187f548730f48a77fc437ca12bb40b4 Author: Thamme GowdaDate: 2016-03-02T09:12:26Z Copy Proxy download fix to 2.x > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Labels: memex > Fix For: 1.13 > > Attachments: TIKA-1816-proxy-fix.patch > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX
[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177637#comment-15177637 ] ASF GitHub Bot commented on TIKA-1841: -- GitHub user zetisam opened a pull request: https://github.com/apache/tika/pull/86 fix for TIKA-1841 contributed by zetisam You can merge this pull request into a Git repository by running: $ git pull https://github.com/zetisam/tika TIKA-1841 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/86.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #86 commit ea82d8538dbd7a1f68d4d290ad0c115f62b29c76 Author: Sam HeijensDate: 2016-02-15T15:09:51Z fix for TIKA-1841 contributed by zetisam > Different XML output structure for PPT and PPTX > --- > > Key: TIKA-1841 > URL: https://issues.apache.org/jira/browse/TIKA-1841 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H > > Issue is slightly related to TIKA-1840 > I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is > different. > The structure for PPTX seems as follows: > {code} > > > //optional > //optional > ... > > > //optional > //optional > {code} > Note that there's no parent slide element to indicate the start and end of > each slide. > For powerpoint the structure is as follows: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > > {code} > In my application, I'm using XPath to get the desired information . As the > XML structure is different, I have to differentiate my XPath queries whether > the file is PPT (old) or PPTX (new). It would be nice for Tika to return the > same XML for both. > I would propose changing the XML structure to this: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > {code} > So, essentially, like the current PPT output, but without the list of notes > at the end (as this is also omitted for PPTX). > On the one hand this generalizes PPT(X) handling, on the other it can break > existing (external) functionality relying on a specific XML output format. > I don't know if this is something the project wants fixed or not. If so, I'm > willing to donate my time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1883) Identification of Mime Type for Empty Files
[ https://issues.apache.org/jira/browse/TIKA-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178469#comment-15178469 ] ASF GitHub Bot commented on TIKA-1883: -- GitHub user adityardesai opened a pull request: https://github.com/apache/tika/pull/87 Fix for TIKA-1883 and 1884 TIKA 1883 Identification of Mime types for empty files, updating TIKA 1.12 source code to fix this issue. The Tika Detector and Parsers have been modified accordingly to identify the empty files and classify them. TIKA 1884 Updating Tika's Mime Repository with the following file types 1. .sfdu - Standard Formatted Data Unit 2. .CDF - Common Data Format having magic byte CDF with 0 offset Tika Mime Repository is updated with these file types. The updated codes is available at https://github.com/RashmiNalwad/MIME-Type-Identification-of-TREC-POLAR-DATASET You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/87.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #87 commit 1a3749fa632fdb8ad0bcb2cea673113031f9b4be Author: Chris MattmannDate: 2015-06-25T17:54:55Z Fix for TIKA-1659 ZipContainerDetector does not detect all IPA files contributed by Rami Shomali this closes #51. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1687594 13f79535-47bb-0310-9956-ffa450edef68 commit 90a2202b5b4a75e7f673bfb42a912cb97ae6d26e Author: Tim Allison Date: 2015-06-28T01:57:30Z TIKA-1663 add a DigestingParser git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1687981 13f79535-47bb-0310-9956-ffa450edef68 commit 444dadd5eb090f6e2998507e444b2014905cb90f Author: Chris Mattmann Date: 2015-06-29T05:19:48Z Fix for TIKA-1664: GDALParser now correctly sets nitf as a supported media type contributed by Joseph North this closes #53. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688086 13f79535-47bb-0310-9956-ffa450edef68 commit 761273f9e69c4a7595e50ccd6a2d9304c398d0b1 Author: Chris Mattmann Date: 2015-06-29T05:26:52Z Fix for TIKA-1669: xpath node test ./node() should match all contained nodes contributed by WulfB this closes #52 git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688087 13f79535-47bb-0310-9956-ffa450edef68 commit fd8514c2c512d9dcc1039aadf1dbc64c1ff6d3fc Author: Chris Mattmann Date: 2015-06-29T14:34:29Z Rollback r1688087 as it seems to cause some tests to fail. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688239 13f79535-47bb-0310-9956-ffa450edef68 commit 2a47d9aa340d529f027c94f3c233645fb2f8bf7e Author: Tim Allison Date: 2015-06-30T00:48:03Z TIKA-1601: integrate Jackcess to parse MSAccess files git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688337 13f79535-47bb-0310-9956-ffa450edef68 commit 06cfbaafeb308bd979fd2214a4b1a15353a9b4ab Author: Chris Mattmann Date: 2015-07-01T13:21:41Z Fix for TIKA-1602: Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill this closes #40. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688647 13f79535-47bb-0310-9956-ffa450edef68 commit 425506e90500dadcccf82fd66aa15ce14d23facc Author: Tyler Palsulich Date: 2015-07-02T08:13:00Z TIKA-1536. Upgrade to Java 1.7. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688779 13f79535-47bb-0310-9956-ffa450edef68 commit 4695df5672492c38a8abcd230c8545f982a7f65d Author: Tyler Palsulich Date: 2015-07-02T08:14:48Z TIKA-1536. Update CHANGES.txt with upgrade to Java 7. git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688780 13f79535-47bb-0310-9956-ffa450edef68 commit de5a2dec6924ebe01e4bf323a98abd208cf9aa7e Author: Nick Burch Date: 2015-07-02T10:35:06Z Remove change comment, TIKA-1602 git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688805 13f79535-47bb-0310-9956-ffa450edef68 commit 2764fb8606964c3350c781ecf5df4042706b4099 Author: Tim Allison Date: 2015-07-02T13:47:23Z TIKA-1673 drop source file name from embedded file path; made a few java 7 updates; added timing for embedded docs git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1688827 13f79535-47bb-0310-9956-ffa450edef68 commit
[jira] [Commented] (TIKA-1926) JSON TEI Exception
[ https://issues.apache.org/jira/browse/TIKA-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223560#comment-15223560 ] ASF GitHub Bot commented on TIKA-1926: -- GitHub user hasanayesha opened a pull request: https://github.com/apache/tika/pull/97 fix for TIKA-1926 contributed by hasanayesha JSON TEI Exception Handled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hasanayesha/tika TIKA-1926 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/97.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #97 commit 63bb15467fcd6e8766b0361e78231f6f7f6a4a08 Author: hasanayeshaDate: 2016-04-03T23:52:10Z fix for TIKA-1926 contributed by hasanayesha > JSON TEI Exception > -- > > Key: TIKA-1926 > URL: https://issues.apache.org/jira/browse/TIKA-1926 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Ayesha Hasan >Priority: Minor > Labels: easyfix, patch > Fix For: 1.12 > > > JSONException being thrown by grobid when the json TEI object wasn't found. > Fixed it by adding a try and catch block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1916) NPE in OpenDocumentParser
[ https://issues.apache.org/jira/browse/TIKA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223990#comment-15223990 ] ASF GitHub Bot commented on TIKA-1916: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/94 > NPE in OpenDocumentParser > - > > Key: TIKA-1916 > URL: https://issues.apache.org/jira/browse/TIKA-1916 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Nick C >Assignee: Tim Allison >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: MissingMeta.odt > > > NPE in OpenDocumentParser when no "meta.xml" file exists -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1927) NPE in JDBCTableReader
[ https://issues.apache.org/jira/browse/TIKA-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224207#comment-15224207 ] ASF GitHub Bot commented on TIKA-1927: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/98 > NPE in JDBCTableReader > -- > > Key: TIKA-1927 > URL: https://issues.apache.org/jira/browse/TIKA-1927 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Nick C >Assignee: Tim Allison >Priority: Minor > Labels: easyfix, patch > Fix For: 1.13 > > > NPE when there is a null String in a SQLite DB. > Caused by: java.lang.NullPointerException > at > org.apache.tika.parser.jdbc.JDBCTableReader.addAllCharacters(JDBCTableReader.java:252) > at > org.apache.tika.parser.jdbc.JDBCTableReader.handleCell(JDBCTableReader.java:135) > at > org.apache.tika.parser.jdbc.JDBCTableReader.nextRow(JDBCTableReader.java:95) > at > org.apache.tika.parser.jdbc.AbstractDBParser.parse(AbstractDBParser.java:90) > at > org.apache.tika.parser.jdbc.SQLite3Parser.parse(SQLite3Parser.java:78) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1925) Composite External Parser like Exiftool fails to run on Windows.
[ https://issues.apache.org/jira/browse/TIKA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222655#comment-15222655 ] ASF GitHub Bot commented on TIKA-1925: -- GitHub user mit2nil opened a pull request: https://github.com/apache/tika/pull/96 fix for TIKA-1925 contributed by Nilay Chheda @chrismattmann Please review the change and let me know they can be contributed back to Tika. Issue description: [https://issues.apache.org/jira/browse/TIKA-1925](https://issues.apache.org/jira/browse/TIKA-1925) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mit2nil/tika TIKA-1925 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/96.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #96 commit c6e2b028beed78e66a80e7e22bc5d9f74b240dbe Author: mit2nilDate: 2016-04-02T02:11:08Z fix for TIKA-1925 contributed by Nilay Chheda > Composite External Parser like Exiftool fails to run on Windows. > > > Key: TIKA-1925 > URL: https://issues.apache.org/jira/browse/TIKA-1925 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.12 > Environment: Windows 10, Intel i7 6550U 64-Bit processor >Reporter: Nilay Chheda > Fix For: 1.13 > > Attachments: ExternalParser_modified.java, ExternalParser_orig.java > > > While trying to run EXIFTool Parser using Tika on Windows OS, we are getting > following error output. > (Ref: http://wiki.apache.org/tika/EXIFToolParser) > java.io.IOException: Cannot run program "env": CreateProcess error=2, The > system cannot find the file specified > at java.lang.ProcessBuilder.start(Unknown Source) > at java.lang.Runtime.exec(Unknown Source) > at java.lang.Runtime.exec(Unknown Source) > at > org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:182) > at > org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:145) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144) > Caused by: java.io.IOException: CreateProcess error=2, The system cannot find > the file specified > at java.lang.ProcessImpl.create(Native Method) > at java.lang.ProcessImpl.(Unknown Source) > at java.lang.ProcessImpl.start(Unknown Source) > ... 13 more > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.external.ExternalParser@51efea79 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144) > Caused by: java.lang.NullPointerException > at > org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:218) > at > org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:145) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 7 more > After analyzing the stack trace and little experimentation, we found that > "env" is unix/Mac OS X/Linux specific command and does not work on Windows. > We were able to workaround this problem by adding some Windows specific code, > recompile Tika and run again with similar setup. I am attaching the original > file and modified file for review. > If fix is acceptable by Tika specific standards, I can send the pull request > on Github to contribute the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1914) ExecutableParser doesn't call start document
[ https://issues.apache.org/jira/browse/TIKA-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224842#comment-15224842 ] ASF GitHub Bot commented on TIKA-1914: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/93 > ExecutableParser doesn't call start document > > > Key: TIKA-1914 > URL: https://issues.apache.org/jira/browse/TIKA-1914 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Nick C >Priority: Trivial > Labels: patch > Fix For: 1.13 > > > The ExecutableParser doesn't call start document which causes errors when > producing XHTML -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files
[ https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257333#comment-15257333 ] ASF GitHub Bot commented on TIKA-1893: -- GitHub user mkampasi opened a pull request: https://github.com/apache/tika/pull/110 fix for TIKA-1893 contributed by mkampasi Added a custom parser class for parsing Apple ICNS files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkampasi/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #110 commit 0cdf17df913cadf23c8074d46d386fa230f8198c Author: mkampasiDate: 2016-04-26T00:20:38Z Adding parser for ICNS files > Add new mimetype for *.icns (Apple Icon Image Format) files > > > Key: TIKA-1893 > URL: https://issues.apache.org/jira/browse/TIKA-1893 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.11 >Reporter: Manisha Kampasi >Priority: Minor > Labels: patch > > Currently, TIKA does not support the "image/icns" mime type for *.icns files > (Apple Icon Image Format). This can be added to the tika-mimetypes.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar
[ https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260457#comment-15260457 ] ASF GitHub Bot commented on TIKA-1885: -- Github user adeshgupta closed the pull request at: https://github.com/apache/tika/pull/89 > Tika MIME updates for *.cdf and *.xar and custom zero length file detector > based on TREC-DD-Polar > - > > Key: TIKA-1885 > URL: https://issues.apache.org/jira/browse/TIKA-1885 > Project: Tika > Issue Type: Sub-task > Components: core, detector, mime >Affects Versions: 1.11 > Environment: Windows OS X64 , Java >Reporter: Adesh Gupta >Assignee: Chris A. Mattmann >Priority: Critical > Labels: memex, nsfpolar > Fix For: 1.13 > > > Updated tika-mimetypes.xml and detector to identify new file types in TREC DD > Polar dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1938) HtmlParser drops
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260771#comment-15260771 ] ASF GitHub Bot commented on TIKA-1938: -- GitHub user naegelejd opened a pull request: https://github.com/apache/tika/pull/111 fix for TIKA-1938 contributed by naegelejd Adds HtmlParser support for tags within You can merge this pull request into a Git repository by running: $ git pull https://github.com/naegelejd/tika TIKA-1938 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/111.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #111 commit b6d23c189e852fa2e41b441c18bfe3e66e3f67c4 Author: Joseph NaegeleDate: 2016-04-27T18:35:11Z fix for TIKA-1938 contributed by naegelejd add HtmlParser support for