[jira] [Created] (TIKA-3488) Security issue XXE in TIKA due to JDOM

2021-07-21 Thread Arvind Jagtap (Jira)
Arvind Jagtap created TIKA-3488:
---

 Summary: Security issue XXE in TIKA due to JDOM
 Key: TIKA-3488
 URL: https://issues.apache.org/jira/browse/TIKA-3488
 Project: Tika
  Issue Type: Bug
  Components: tika-server
Affects Versions: 1.25
Reporter: Arvind Jagtap


Apache TIKA 1.35 is vulnerable due to dependency on JDOM 2.0.6. Black Duck Hub 
has reported this vulnerability CVE-2021-33813 with more detail on the 
following page. 

[https://nvd.nist.gov/vuln/detail/CVE-2021-33813#range-6782705]

Although the following issue is entered, it is not yet fixed and there is no 
timeline given.

https://github.com/hunterhacker/jdom/issues/189

There are some workaround discussed on this issue. Can this be fixed in TIKA in 
the meanwhile?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)
Sebastian Nagel created TIKA-3489:
-

 Summary: Robots.txt files frequently identified as message/rfc822
 Key: TIKA-3489
 URL: https://issues.apache.org/jira/browse/TIKA-3489
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.27, 1.26, 1.25
Reporter: Sebastian Nagel
 Attachments: robots.txt

The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if the 
file starts with a "User-Agent" rule and contains also a second rule not too 
far away from the beginning, e.g.:
{noformat}
User-Agent: goodbot
Disallow:

User-Agent: badbot
Disallow: /
{noformat}

The change 
[7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
 requires that two different clauses are matched. However, the two occurrences 
of "User-Agent:" (initial and after a new line) are treated as different 
instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated TIKA-3489:
--
Affects Version/s: 2.0.0

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384905#comment-17384905
 ] 

Tim Allison commented on TIKA-3489:
---

Should we try to detect robots.txt files as their own mime?

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3153) Text File identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384913#comment-17384913
 ] 

Sebastian Nagel commented on TIKA-3153:
---

Wasn't this already resolved in 1.25?

{noformat}
$> java -jar tika-app-2.0.0.jar -d TextFileIdentifiedAsMessage.txt
text/plain
$> java -jar tika-app-1.25.jar -d TextFileIdentifiedAsMessage.txt 2>/dev/null 
text/plain
$> java -jar tika-app-1.24.jar -d TextFileIdentifiedAsMessage.txt 2>/dev/null
message/rfc822
{noformat}

> Text File identified as message/rfc822
> --
>
> Key: TIKA-3153
> URL: https://issues.apache.org/jira/browse/TIKA-3153
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: TextFileIdentifiedAsMessage.txt
>
>
> Text file containing the word Received: is identified as message/rfc22.
> We were earlier using version 1.9 and it used to identify file type properly 
> as text/plain.
> Even if multiple lines are there, if one line with Received: is present, 
> content type is incorrectly identified.
> To check we can run java -jar tika-app-1.24.1.jar 
> TextFileIdentifiedAsMessage.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2021-07-21 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384915#comment-17384915
 ] 

Sebastian Nagel commented on TIKA-2443:
---

Looks like this was already resolved in 1.25:
{noformat}
$> cat TIKA-2443.txt 
Date: 06/25/2014 15:54:19
foo bar
$> java -jar tika-app-2.0.0.jar -d TIKA-2443.txt
text/plain
$> java -jar tika-app-1.25.jar -d TIKA-2443.txt 2>/dev/null 
text/plain
$> java -jar tika-app-1.24.jar -d TIKA-2443.txt 2>/dev/null
message/rfc822
{noformat}

> Plain text file identified as rfc822 and which can cause StackOverflowError
> ---
>
> Key: TIKA-2443
> URL: https://issues.apache.org/jira/browse/TIKA-2443
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.11, 1.16
>Reporter: Viorica Visan
>Priority: Major
>
> I have a file called test.txt, containing only:
> Date: 06/25/2014 15:54:19
> And some more text I am writing. This will
> be detected as rfc822
> This file is detected and parsed as message/rfc822. 
> I think the magic rule on "Date: " is too strong and it should have detected 
> only as plain/text file. It looks to me like the reverse of  
> https://issues.apache.org/jira/browse/TIKA-879 
> We noticed this issue, because we have a large log file, which has many lines 
> with Date, Log level and Message which is parsed as message/rfc822 (only 
> because it starts with "Date:") and which throws 
> StackOverflowError in the end. 
> Is there some workaround to make this rule weaker ? through configuration ? 
> We use DefaultParser and everything default. We use tika in 1.11 version, but 
> we tried also  with tika 1.16 and we saw the same StackOverflowError (which 
> probably again happened because it was parsed as a rc822 type).
> The only workaround that I found was to add 
> custom-mimetypes.xml like this
>  
> 
>   
> 
>   
> Would you recomend some other workaround to make sure the file does not get 
> parsed as rfc822 ? 
> And I have another question: can this custom-mimetypes.xml be specified from 
> an external location? 
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2443.
---
Fix Version/s: 1.25
   Resolution: Fixed

Thank you [~snagel]!

> Plain text file identified as rfc822 and which can cause StackOverflowError
> ---
>
> Key: TIKA-2443
> URL: https://issues.apache.org/jira/browse/TIKA-2443
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.11, 1.16
>Reporter: Viorica Visan
>Priority: Major
> Fix For: 1.25
>
>
> I have a file called test.txt, containing only:
> Date: 06/25/2014 15:54:19
> And some more text I am writing. This will
> be detected as rfc822
> This file is detected and parsed as message/rfc822. 
> I think the magic rule on "Date: " is too strong and it should have detected 
> only as plain/text file. It looks to me like the reverse of  
> https://issues.apache.org/jira/browse/TIKA-879 
> We noticed this issue, because we have a large log file, which has many lines 
> with Date, Log level and Message which is parsed as message/rfc822 (only 
> because it starts with "Date:") and which throws 
> StackOverflowError in the end. 
> Is there some workaround to make this rule weaker ? through configuration ? 
> We use DefaultParser and everything default. We use tika in 1.11 version, but 
> we tried also  with tika 1.16 and we saw the same StackOverflowError (which 
> probably again happened because it was parsed as a rc822 type).
> The only workaround that I found was to add 
> custom-mimetypes.xml like this
>  
> 
>   
> 
>   
> Would you recomend some other workaround to make sure the file does not get 
> parsed as rfc822 ? 
> And I have another question: can this custom-mimetypes.xml be specified from 
> an external location? 
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3153) Text File identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3153.
---
Fix Version/s: 1.25
   Resolution: Fixed

> Text File identified as message/rfc822
> --
>
> Key: TIKA-3153
> URL: https://issues.apache.org/jira/browse/TIKA-3153
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Fix For: 1.25
>
> Attachments: TextFileIdentifiedAsMessage.txt
>
>
> Text file containing the word Received: is identified as message/rfc22.
> We were earlier using version 1.9 and it used to identify file type properly 
> as text/plain.
> Even if multiple lines are there, if one line with Received: is present, 
> content type is incorrectly identified.
> To check we can run java -jar tika-app-1.24.1.jar 
> TextFileIdentifiedAsMessage.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384931#comment-17384931
 ] 

Tim Allison commented on TIKA-3489:
---

[~nick], any recommendations? {{text/x-robots}} subtype of {{text/plain}}?

LOL: http://www.nextthing.org/archives/2007/03/12/robotstxt-adventure

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384966#comment-17384966
 ] 

Tim Allison commented on TIKA-3489:
---

I added mime detection for robots.txt in {{main}} with mime {{text/x-robots}}.  
We can change this before the next release if there are objections.

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Interesting PDF on stackoverflow

2021-07-21 Thread Tim Allison
https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly

Not sure there's much we should do on the Tika side.

How hard would it be to add an "extract only text that is on the page" feature?

Best,

   Tim


[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384992#comment-17384992
 ] 

Sebastian Nagel commented on TIKA-3489:
---

The [robots.txt RFC draft|https://datatracker.ietf.org/doc/draft-koster-rep/] 
requires "text/plain" as media type. Would be sound more straightforward to 
just recognize a robots.txt file as "text/plain".

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Interesting PDF on stackoverflow

2021-07-21 Thread Tilman Hausherr
Maybe this could be done with the ExtractTextByArea example. However 
IIRC the coordinates are awt-like (y 0 on top) coordinates, so the PDF 
coordinates should somehow be mapped to this.


Tilman

Am 21.07.2021 um 18:21 schrieb Tim Allison:

https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly

Not sure there's much we should do on the Tika side.

How hard would it be to add an "extract only text that is on the page" feature?

Best,

Tim





[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385061#comment-17385061
 ] 

Hudson commented on TIKA-3489:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #286 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/286/])
TIKA-3489 -- add mime detection for robots.txt files (tallison: 
[https://github.com/apache/tika/commit/5e2a3c081b3867086e417cb5cb032cb12be3c19d])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRobots.txt
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java


> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3490) Fix serialization in opensearch emitter for embedded documents

2021-07-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-3490:
-

 Summary: Fix serialization in opensearch emitter for embedded 
documents
 Key: TIKA-3490
 URL: https://issues.apache.org/jira/browse/TIKA-3490
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tim Allison


Serialization isn't working for embedded documents in the OpenSearch emitter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3490) Fix serialization in opensearch emitter for embedded documents

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3490:
--
Description: Serialization isn't working for embedded documents in the 
OpenSearch emitter.  This fix is simple; the effect of the bug, catastrophic 
for this emitter. :(  (was: Serialization isn't working for embedded documents 
in the OpenSearch emitter.)

> Fix serialization in opensearch emitter for embedded documents
> --
>
> Key: TIKA-3490
> URL: https://issues.apache.org/jira/browse/TIKA-3490
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.0.0
>Reporter: Tim Allison
>Priority: Major
>
> Serialization isn't working for embedded documents in the OpenSearch emitter. 
>  This fix is simple; the effect of the bug, catastrophic for this emitter. :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3483:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3454) Facilitate configuration of translation and transcription impls in tika-server/tika-docker/tika-helm

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3454:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Facilitate configuration of translation and transcription impls in 
> tika-server/tika-docker/tika-helm
> 
>
> Key: TIKA-3454
> URL: https://issues.apache.org/jira/browse/TIKA-3454
> Project: Tika
>  Issue Type: Bug
>  Components: docker, helm, server
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> I need an easy way to configure, for example, the 
> [AmazonTranscribe|https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-transcribe-aws/src/main/java/org/apache/tika/parser/transcribe/aws/AmazonTranscribe.java]
>  implementation when I deploy tika-server (tika-docker) via the Helm chart 
> into Kubernetes. The same goes for TIka translation implementations.
> We have [documentation for configuring tika-server to run via 
> Docker|https://github.com/apache/tika-docker#custom-config] however 
> currently, there is [no way to configure translators or 
> transcribers|https://tika.apache.org/1.26/configuring.html#Configuring_Translators]
>  
> This task will determine a sensible means by which we can configure 
> translators and transcribers for tika-server such that it can be used further 
> downstream via Docker and Helm on K8s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3452) java.nio.file.FileSystemException Read-only file system in 2.0.0-BETA tika-docker

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3452:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> java.nio.file.FileSystemException Read-only file system in 2.0.0-BETA 
> tika-docker
> -
>
> Key: TIKA-3452
> URL: https://issues.apache.org/jira/browse/TIKA-3452
> Project: Tika
>  Issue Type: Bug
>  Components: docker, helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> The following ExecutionException is thrown when I attempt to run [tika-docker 
> 2.0.0-BETA|https://hub.docker.com/layers/apache/tika/2.0.0-BETA-full/images/sha256-2d735f7bdf86e618a5390d92614a310697f9134d11a2b2e4c1c0cfcde1f68b1d?context=explore]
> {code:bash}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.util.concurrent.ExecutionException: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116)
>   at 
> org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88)
>   at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
> Caused by: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
>   at java.base/java.nio.file.Files.newByteChannel(Files.java:375)
>   at java.base/java.nio.file.Files.createFile(Files.java:652)
>   at 
> java.base/java.nio.file.TempFileHelper.create(TempFileHelper.java:137)
>   at 
> java.base/java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:160)
>   at java.base/java.nio.file.Files.createTempFile(Files.java:917)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:220)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:210)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:117)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {code}
> There are differences/improvements in the way the [tika-server child process 
> is 
> spawned|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks]
>  in the 2.0.0-BETA docker image. I am investigating a fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3400:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3404) Rearchitect GoogleTranslator to use https://github.com/googleapis/java-translate

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3404:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Rearchitect GoogleTranslator to use 
> https://github.com/googleapis/java-translate
> 
>
> Key: TIKA-3404
> URL: https://issues.apache.org/jira/browse/TIKA-3404
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> GoogleTranslator is busted just now...
> I can't seem to get it to work.
> I just propose to re-architect it to use the official [Google Java 
> SDK|https://github.com/googleapis/java-translate]
> I have some other tasks to work on but I will try to come back to this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3003) Remove unused dependencies

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3003:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Remove unused dependencies
> --
>
> Key: TIKA-3003
> URL: https://issues.apache.org/jira/browse/TIKA-3003
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: César Soto Valero
>Priority: Minor
> Fix For: 2.0.0-BETA
>
>
> I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module 
> *tika-parsers*  to prevent from having a vulnerable version from 
> *edu.ucar:grib*. However, this dependency is not used and, therefore, it can 
> be removed to make the pom clearer and the dependency tree of this module 
> complex.
> In addition, dependency *net.sf.ehcache:ehcache-core*, induced transitively 
> from *edu.ucar:cdm:4.5.5*, is not used and can be excluded safely. Notice 
> that the size of the jar of *ehcache-core* is around 1.3MB, thus removing it 
> has a positive impact on the size of *tika-parsers*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3348:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Improve the workflow for extracting and returning images from PDFs and other 
> containers using Tika Server..
> ---
>
> Key: TIKA-3348
> URL: https://issues.apache.org/jira/browse/TIKA-3348
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.25
>Reporter: Simon Lucy 
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> There's a set of bumps in the road to navigate when extracting images from 
> PDFs, retrieving them and managing the metadata using Tika Server.
> The first is knowing that /unpack will do the basic job and return the 
> embedded objects in a zip file (presuming setExtractInlineImages is True). 
> Documenting this clearly in the Tika Server wiki page would help people 
> enormously.
> But processing those images after they've been extracted will either need 
> inspecting with another tool or using /rmeta to return the mime types and the 
> rest of the metadata.
> This means that multiple passes need to be made over the same file and the 
> same processes of extraction, identification and temporary storage will be 
> made over.
> The server processes of /rmeta and /unpack need to be melded. The simplest 
> may be to generate /rmeta metadata in the __META__ file object added to the 
> returned zip file. A more complicated but perhaps more hypermedia way would 
> be to use Content Negotiation and indicate an Accept application/zip in the 
> /rmeta request.
> I've indicated a Fix version of 2.0 because it is if not a breaking change a 
> considerable one.
> I'm available for Help Wanted, if that helps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3420) Set tesseract ocr langauges as docker build args

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3420:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Set tesseract ocr langauges as docker build args
> 
>
> Key: TIKA-3420
> URL: https://issues.apache.org/jira/browse/TIKA-3420
> Project: Tika
>  Issue Type: Improvement
>  Components: docker, tika-docker
>Affects Versions: 1.26
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> PR and context available at https://github.com/apache/tika-docker/pull/2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2945) AutoDetectParser should skip the content type detection if Metadata already has it

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2945:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> AutoDetectParser should skip the content type detection if Metadata already 
> has it
> --
>
> Key: TIKA-2945
> URL: https://issues.apache.org/jira/browse/TIKA-2945
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.23, 2.0.0-BETA
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3368:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Add Bill of Materials (BOM) artifact (Tika 1.x)
> ---
>
> Key: TIKA-3368
> URL: https://issues.apache.org/jira/browse/TIKA-3368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0-BETA
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2758) Possible error charset detection

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2758:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Possible error charset detection
> 
>
> Key: TIKA-2758
> URL: https://issues.apache.org/jira/browse/TIKA-2758
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.18
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 2.0.0-BETA
>
> Attachments: detroidnews.html, grep_charsets.csv, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> all 995 unit tests and observed three failures, two encoding issues and one 
> other weird thing. The tests use real HTML.
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could 
> take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our 
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test 
> complaining about the character encoding being incorrect. It seems somewhere 
> before 1.17 it was faultly just as it is now with 1.18 and higher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3367:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Add Bill of Materials (BOM) artifact
> 
>
> Key: TIKA-3367
> URL: https://issues.apache.org/jira/browse/TIKA-3367
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0-BETA
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2796) Update GoogleTranslator to use google-cloud-translate Java API

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2796:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Update GoogleTranslator to use google-cloud-translate Java API
> --
>
> Key: TIKA-2796
> URL: https://issues.apache.org/jira/browse/TIKA-2796
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.19.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> The GoogleTranslator logic has been neglected and is no longer functional.
> We can upgrade to use the official Google Java API at 
> https://search.maven.org/artifact/com.google.cloud/google-cloud-translate/1.54.0/jar
> Additionally, documentaion for this upgrade can be found at 
> https://cloud.google.com/translate/docs/quickstart-client-libraries#client-libraries-install-java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3270) Render non-text in PDFs for OCR

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3270:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Render non-text in PDFs for OCR
> ---
>
> Key: TIKA-3270
> URL: https://issues.apache.org/jira/browse/TIKA-3270
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0-BETA
>
> Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3314) Treat soft hyphens like hyphens

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3314:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Treat soft hyphens like hyphens
> ---
>
> Key: TIKA-3314
> URL: https://issues.apache.org/jira/browse/TIKA-3314
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-eval
>Affects Versions: 1.25
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.26, 2.0.0-BETA
>
> Attachments: content_diffs_no_exceptions1.xlsx
>
>
> The next PDFBox version identifies soft-hyphens (00AD) and returns them as 
> such. Tika-eval swallows them, thus reporting differences. This can be shown 
> with the file attached to PDFBOX-5115 in "Max-Planck-Institut" and in the 
> attached excel file in line 4.
> Proposed change:
> add
> "\u00AD" => "-"
> to 
> lucene-char-mapping.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2623:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
> Fix For: 2.0.0-BETA
>
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2794) Tika extracts text from pdf on MacBook, but not windows server.,

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2794:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika extracts text from pdf on MacBook, but not windows server.,
> 
>
> Key: TIKA-2794
> URL: https://issues.apache.org/jira/browse/TIKA-2794
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.19.1
> Environment: MacBook Pro and Windows Server 2012
> This code works on the enclosed pdf file on a MacBook, but not using windows 
> server?
>Reporter: Paul Hallett
>Priority: Major
> Fix For: 2.0.0-BETA
>
> Attachments: test2.pdf
>
>
> try:
> headers = \{'X-Tika-PDFextractInlineImages': 'true',} # 
> data = parser.from_file(pathtofile, serverEndpoint=self.TIKA_SERVER, 
> headers=headers)
> charstoreturn = data['content'].strip().split()[:limit]
> charstoreturn = ' '.join(charstoreturn).replace("\n", " ").replace('"', 
> "'").replace(",","").replace("'","'")
> return True, charstoreturn
> except Exception as err:
> return False, "error {} on file: {}.\n".format(str(err), pathtofile)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2346:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Allow Office format parsers to exclude parsing shapes
> -
>
> Key: TIKA-2346
> URL: https://issues.apache.org/jira/browse/TIKA-2346
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> The Office format parsers support including or excluding of deleted text and 
> moved text. It would be good to also support something similar for 
> shape-based text, though probably not for PPT / PPTX as that's almost all 
> shape-based!
> (This has been done hackily in the Alfresco fork of Tika at  
> https://github.com/Alfresco/tika/commit/32aca3fd96816ad49b869a82c9ba0f02265f8744
>  but would be good to do properly)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2946) Review how TikaConfig can avoid parsing XML itself

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2946:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Review how TikaConfig can avoid parsing XML itself
> --
>
> Key: TIKA-2946
> URL: https://issues.apache.org/jira/browse/TIKA-2946
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Reporter: Sergey Beryozkin
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> I have some issues right now with initializing the {{TikaConfig}} at the 
> Quarkus build time. The reason I'd like to do it is to avoid having the XML 
> classes loaded into the memory when the application starts. Moving the XML 
> parsing code out (perhaps into a static TikaConfig factory method) and few 
> other minor tweaks will help.
> I'll try to provide more input when 2.0 will become closer 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2701) Text is not extracted properly from WMF files

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2701:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Text is not extracted properly from WMF files
> -
>
> Key: TIKA-2701
> URL: https://issues.apache.org/jira/browse/TIKA-2701
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Grigoriy Alekseev
>Priority: Major
> Fix For: 2.0.0-BETA
>
> Attachments: thumbnail_1.wmf
>
>
> Text is always extracted assuming it is in cp-1252 encoding. The attached 
> thumbnail_1.wmf has text in Shift JIS and is extracted incorrectly. Should be 
> 普林斯.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2711) When parsing a UNIX text file apostrophes are rendered as ?

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2711:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> When parsing a UNIX text file apostrophes are rendered as ?
> ---
>
> Key: TIKA-2711
> URL: https://issues.apache.org/jira/browse/TIKA-2711
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
> Environment: Windows 10
>Reporter: Ichbiah
>Priority: Minor
> Fix For: 2.0.0-BETA
>
> Attachments: long_text_dos.txt, long_text_unix.txt, petit_dos.txt, 
> petit_unix.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" 
> from the DOS version of the file. For the UNIX version of the file the 
> apostrophes are falsely rendered as question marks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2720) A parser to output universal sentence encodings to text

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2720:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> A parser to output universal sentence encodings to text
> ---
>
> Key: TIKA-2720
> URL: https://issues.apache.org/jira/browse/TIKA-2720
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-dl
>Reporter: Thejan Wijesinghe
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> This parser encodes a text into high dimensional vectors that can be used for 
> text classification, semantic similarity, clustering and other natural 
> language tasks. The model is trained and optimized for greater-than-word 
> length text, such as sentences, phrases or short paragraphs. It is trained on 
> a variety of data sources and a variety of tasks with the aim of dynamically 
> accommodating a wide variety of natural language understanding tasks. The 
> input is variable length English text and the output is a 512 dimensional 
> vector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2492) Remove pdfdebugger from tika

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2492:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Remove pdfdebugger from tika
> 
>
> Key: TIKA-2492
> URL: https://issues.apache.org/jira/browse/TIKA-2492
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA
>
>
> PDFDebugger isn't needed in tika but it is a dependency in pdfbox-tools 
> (because that one contains the command line interface, which calls the PDFBox 
> command line tools).
> Thus I suggest that the tika parser pom be changed like this:
> {code}
> 
>   org.apache.pdfbox
>   pdfbox-tools
>   ${pdfbox.version}
>   
> 
>   commons-logging
>   commons-logging
> 
> +
> +  org.apache.pdfbox
> +  pdfbox-debugger
> +
>   
> {code}
> This saves you 200KB in tika-app. That's not much, but every weight loss 
> counts :-)
> It should also be possible to get it removed from tika-bundle, but I don't 
> know how to remove it properly. Just removing it from "Embed-Dependency" 
> isn't enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2346:
--
Fix Version/s: 2.0.1

> Allow Office format parsers to exclude parsing shapes
> -
>
> Key: TIKA-2346
> URL: https://issues.apache.org/jira/browse/TIKA-2346
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Nick Burch
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> The Office format parsers support including or excluding of deleted text and 
> moved text. It would be good to also support something similar for 
> shape-based text, though probably not for PPT / PPTX as that's almost all 
> shape-based!
> (This has been done hackily in the Alfresco fork of Tika at  
> https://github.com/Alfresco/tika/commit/32aca3fd96816ad49b869a82c9ba0f02265f8744
>  but would be good to do properly)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2596) Make PDF2XHTML and AbstractPDF2XHTML public classes

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2596:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Make PDF2XHTML and AbstractPDF2XHTML public classes
> ---
>
> Key: TIKA-2596
> URL: https://issues.apache.org/jira/browse/TIKA-2596
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.17
>Reporter: Kyle Dent
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> In tika-parsers/pdf/ would it be possible to make PDF2XHTML and 
> AbstractPDF2XHTML public so they can be inherited. We would like to capture 
> some additional font and layout information when outputting XHTML. We would 
> like to inherit PDF2XHTML and override some of the functions to do what we 
> need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2565:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Upgrade edu.ucar dependencies to 4.6.11
> ---
>
> Key: TIKA-2565
> URL: https://issues.apache.org/jira/browse/TIKA-2565
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.17
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.0.0-BETA
>
>
> An [existing PR|https://github.com/apache/tika/pull/212/files] suggests to 
> upgrade the netcdf4-java dependency, however it does not address the issue.
> This PR will add the correct Maven repository configuration and then make the 
> upgrade(s).
> https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/BuildDependencies.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2312) [Mp3Parser] expose fields form ID3TagsAndAudio

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2312:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> [Mp3Parser] expose fields form ID3TagsAndAudio 
> ---
>
> Key: TIKA-2312
> URL: https://issues.apache.org/jira/browse/TIKA-2312
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Łukasz Ozimek
>Priority: Trivial
>  Labels: beginner, easyfix
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Hi,
> First of all that's my  first issue in ASF jira so sorry for mistakes.
> Currently I am working on some custom Parsers for MP3 files. The reason I 
> would like to have access to fields in this class is that the system from 
> which I am transforming data depends on availability of particular version 
> ID3 tags and this class easily allow me to do that. 
> Moreover in current code base the Mp3Parser expose method 
> {code}
>  protected static ID3TagsAndAudio getAllTagHandlers(InputStream stream, 
> ContentHandler handler)
>throws IOException, SAXException, TikaException {
> }
> {code}
> and return object which haven't any accessible field. That's make me strange.
> Is there any reason why is it that?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2558) Add a new pid api to Tika

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2558:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Add a new pid api to Tika
> -
>
> Key: TIKA-2558
> URL: https://issues.apache.org/jira/browse/TIKA-2558
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.17
> Environment: All platforms on which Tika can run.
>Reporter: Stefan Sveen
>Priority: Minor
> Fix For: 2.0.0-BETA
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Using Tika with Mono on Windows, Linux and Mac I miss a simple and 
> OS-independent way to get the Process ID of a running Tika service. 
> Therefore, I suggest that the following API is added  to Tika:
> *GET* _[/pid|http://localhost:16200/version]_
>  Class: org.apache.tika.server.resource.TikaPID (guessing that this would be 
> the class)
>  Method: getPID
>  Produces: text/plain
> The output would be the integer value (as a string) of the PID
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2071:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
> Fix For: 2.0.0-BETA, 2.0.1
>
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2340) Add explicit deps to tika-parsers which are currently used from transitive scope

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2340:
--
Fix Version/s: 2.0.1

> Add explicit deps to tika-parsers which are currently used from transitive 
> scope
> 
>
> Key: TIKA-2340
> URL: https://issues.apache.org/jira/browse/TIKA-2340
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2639) Update freedesktop.org shared-mime-info-spec hyperlink in MimeTypesReader.java

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2639:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Update freedesktop.org shared-mime-info-spec hyperlink in MimeTypesReader.java
> --
>
> Key: TIKA-2639
> URL: https://issues.apache.org/jira/browse/TIKA-2639
> Project: Tika
>  Issue Type: Task
>  Components: core
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.0.0-BETA
>
>
> When attempting to upgrade TIKA over in OODT, I noticed that the hyerlink for 
> the https://freedesktop.org/wiki/Specifications/shared-mime-info-spec/ is 
> broken in MimeTypesReader.java.
> This issue will simply update the hyperlink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1988:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>Priority: Major
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1988:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Age Detection Tika Recogniser
> -
>
> Key: TIKA-1988
> URL: https://issues.apache.org/jira/browse/TIKA-1988
> Project: Tika
>  Issue Type: New Feature
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>Priority: Major
>  Labels: age, machine_learning, memex, nlp, opennlp
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Author age can be firs feature and more can be added later
> --
> Integrating work done on age classification. More details about classifier in 
> below repo -
> https://github.com/USCDataScience/Age-Predictor
> Git repo have a java client which can be integrated in Tika



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2312) [Mp3Parser] expose fields form ID3TagsAndAudio

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2312:
--
Fix Version/s: 2.0.1

> [Mp3Parser] expose fields form ID3TagsAndAudio 
> ---
>
> Key: TIKA-2312
> URL: https://issues.apache.org/jira/browse/TIKA-2312
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Łukasz Ozimek
>Priority: Trivial
>  Labels: beginner, easyfix
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Hi,
> First of all that's my  first issue in ASF jira so sorry for mistakes.
> Currently I am working on some custom Parsers for MP3 files. The reason I 
> would like to have access to fields in this class is that the system from 
> which I am transforming data depends on availability of particular version 
> ID3 tags and this class easily allow me to do that. 
> Moreover in current code base the Mp3Parser expose method 
> {code}
>  protected static ID3TagsAndAudio getAllTagHandlers(InputStream stream, 
> ContentHandler handler)
>throws IOException, SAXException, TikaException {
> }
> {code}
> and return object which haven't any accessible field. That's make me strange.
> Is there any reason why is it that?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2542) Support in tika-server for getting plain text and metadata at the same time

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2542:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Support in tika-server for getting plain text and metadata at the same time
> ---
>
> Key: TIKA-2542
> URL: https://issues.apache.org/jira/browse/TIKA-2542
> Project: Tika
>  Issue Type: Improvement
>  Components: core, server
>Affects Versions: 1.17
>Reporter: Manolo Caracuel
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0-BETA
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be good to have a way to get a files plain text extracted and also 
> get the metadata detected. Currently you can only get the metadata if the 
> request has Accepts of text/xml or text/html but then the text in the body is 
> not the plain text as it contains html elements as well.
> I propose that when requesting /tika/plain with Accepts header of text/xml, 
> an xhtml document is returned with the metadata in head's meta elements and 
> the plain text in the body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1829:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  NPE 
> 
>
> Key: TIKA-1829
> URL: https://issues.apache.org/jira/browse/TIKA-1829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: OSX 10.11
>Reporter: frank
>Assignee: Tim Allison
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TesseractOCRParser.java
>
>
> Just need to add a check on parameter of context.
> 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN  
> o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property
> java.lang.NullPointerException: null
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  ~[tika-parsers-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) 
> [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.java:7

[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1697:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Parser Implementation for AkomaNtoso Legal XML Documents
> 
>
> Key: TIKA-1697
> URL: https://issues.apache.org/jira/browse/TIKA-1697
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal 
> Document XML standard and used pervasively within parliaments and other 
> legislative arenas.
> This issue should utilize the 
> [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and 
> populate Metadata for AkomaNtoso .xml and .akn documents.
> I'll send a PR for this soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1953:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> tika-server NullPointerException while processing rtfs
> --
>
> Key: TIKA-1953
> URL: https://issues.apache.org/jira/browse/TIKA-1953
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
> Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>Reporter: Ravi
>Assignee: Tim Allison
>Priority: Major
>  Labels: newbie, rtf, tika-python, tika-server, xmlContent,
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server 
> returned status: 422 error
> I start the tika server and then run the following code in the python kernel 
> at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I 
> get the right output but setting this flag creates the NullPointerException 
> below
> --
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@21f0dbb9
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.

[jira] [Updated] (TIKA-2369) Define a clean Recogniser interface: for objects from binary data; and for text classification

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2369:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Define a clean Recogniser interface: for objects from binary data; and for 
> text classification
> --
>
> Key: TIKA-2369
> URL: https://issues.apache.org/jira/browse/TIKA-2369
> Project: Tika
>  Issue Type: Bug
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> As described in TIKA-2360 we should refactor the ObjectRecogniser interface. 
> I propose creating:
> 1. TextRecogniser (per [~thammegowda] it takes INPUT:text input and 
> OUTPUT:set of metadata key values)
> 2. ObjectRecogniser (also per Thamme ObjectRecogniser, VideoLabeller, OCR, 
> Caption - INPUT:raw bytes and OUTPUT:set of metadata key values.)
> We should of course rectify this with Tika-DL and how that folds in. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3104) Detection of memgraph files exported from Xcode

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3104:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Detection of memgraph files exported from Xcode
> ---
>
> Key: TIKA-3104
> URL: https://issues.apache.org/jira/browse/TIKA-3104
> Project: Tika
>  Issue Type: Wish
>  Components: core
>Affects Versions: 1.24
>Reporter: Parth
>Assignee: Tim Allison
>Priority: Major
>  Labels: detection, features, new-parser
> Fix For: 2.0.0-BETA
>
> Attachments: DeepScroll_Example[4988].memgraph, 
> DeepScroll_Example[6314]_bplist.memgraph, 
> DeepScroll_Example[6314]_xml.memgraph, memgraph.xml, out.memgraph.json, 
> out.memgraph.xhtml
>
>
> I wanted to detect a memgraph file linked by a url. But currently detection 
> of memgraph file is not supported. I tried adding to custom-mimetypes but 
> that did not help.  
> 
>  
>  
>  
> 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1724:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Create parser for .obo file format.
> ---
>
> Key: TIKA-1724
> URL: https://issues.apache.org/jira/browse/TIKA-1724
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1724.patch, TIKA-1724.patch
>
>
> This parser implementation caters for files of the [OBO Flat File Format 
> Guide, version 1.4|http://purl.obolibrary.org/obo/oboformat/spec.html] 
> MimeType.
> The OBO format is the text file format used by OBO-Edit, the open source, 
> platform-independent application for viewing and editing ontologies. This 
> file format is used heavily within the clinical and biomedical fields as a 
> particular flat file serialization for ontologies. .obo files are 'typically' 
> accompanied by corresponding .owl serializations as this is also another file 
> format used pervasively within the clinical and biomedical fields.
> I would sincerely appreciate code review. Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2369) Define a clean Recogniser interface: for objects from binary data; and for text classification

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2369:
--
Fix Version/s: 2.0.1

> Define a clean Recogniser interface: for objects from binary data; and for 
> text classification
> --
>
> Key: TIKA-2369
> URL: https://issues.apache.org/jira/browse/TIKA-2369
> Project: Tika
>  Issue Type: Bug
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> As described in TIKA-2360 we should refactor the ObjectRecogniser interface. 
> I propose creating:
> 1. TextRecogniser (per [~thammegowda] it takes INPUT:text input and 
> OUTPUT:set of metadata key values)
> 2. ObjectRecogniser (also per Thamme ObjectRecogniser, VideoLabeller, OCR, 
> Caption - INPUT:raw bytes and OUTPUT:set of metadata key values.)
> We should of course rectify this with Tika-DL and how that folds in. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1688:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Tika Version in Metadata
> 
>
> Key: TIKA-1688
> URL: https://issues.apache.org/jira/browse/TIKA-1688
> Project: Tika
>  Issue Type: Improvement
>Reporter: Paul Ramirez
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Could this be added as X-Tika:version that way downstream there would be 
> traceability to extraction based on version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1808) Head section closed too eager

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1808:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Head section closed too eager
> -
>
> Key: TIKA-1808
> URL: https://issues.apache.org/jira/browse/TIKA-1808
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or 
> this is a problem in TagSoup. In this [1] case a  element appears in the 
> head, causing the head to be closed. Subsequent  elements do not appear 
> in custom ContentHandlers so i cannot read the document's title, or any other 
> meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. 
> schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't 
> really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1709:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika Server doesn't handle multi-part attachments or form-encoded inputs
> 
>
> Key: TIKA-1709
> URL: https://issues.apache.org/jira/browse/TIKA-1709
> Project: Tika
>  Issue Type: Bug
>  Components: server
> Environment: http://github.com/chrismattmann/tika-python/ Windows 7 
> Ultimate
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Downstream in the Tika Python library, I noticed that Tika Server doesn't 
> handle e.g., in /rmeta, multi-part attachments on Windows 7 Ultimate, such as 
> those encoded using curl -T for example. Tika-Server returns back a 415 that 
> it can't properly diagnose what the mime type is.
> See: 
> https://github.com/kennethreitz/requests/issues/2725
> https://github.com/chrismattmann/tika-python/issues/58
> For more info.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1840:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1808) Head section closed too eager

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1808:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Head section closed too eager
> -
>
> Key: TIKA-1808
> URL: https://issues.apache.org/jira/browse/TIKA-1808
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or 
> this is a problem in TagSoup. In this [1] case a  element appears in the 
> head, causing the head to be closed. Subsequent  elements do not appear 
> in custom ContentHandlers so i cannot read the document's title, or any other 
> meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. 
> schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't 
> really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1829:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  NPE 
> 
>
> Key: TIKA-1829
> URL: https://issues.apache.org/jira/browse/TIKA-1829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: OSX 10.11
>Reporter: frank
>Assignee: Tim Allison
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TesseractOCRParser.java
>
>
> Just need to add a check on parameter of context.
> 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN  
> o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property
> java.lang.NullPointerException: null
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  ~[tika-parsers-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) 
> [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.j

[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1709:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Tika Server doesn't handle multi-part attachments or form-encoded inputs
> 
>
> Key: TIKA-1709
> URL: https://issues.apache.org/jira/browse/TIKA-1709
> Project: Tika
>  Issue Type: Bug
>  Components: server
> Environment: http://github.com/chrismattmann/tika-python/ Windows 7 
> Ultimate
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Downstream in the Tika Python library, I noticed that Tika Server doesn't 
> handle e.g., in /rmeta, multi-part attachments on Windows 7 Ultimate, such as 
> those encoded using curl -T for example. Tika-Server returns back a 415 that 
> it can't properly diagnose what the mime type is.
> See: 
> https://github.com/kennethreitz/requests/issues/2725
> https://github.com/chrismattmann/tika-python/issues/58
> For more info.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2071:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
> Fix For: 2.0.0-BETA, 2.0.1
>
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1724:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Create parser for .obo file format.
> ---
>
> Key: TIKA-1724
> URL: https://issues.apache.org/jira/browse/TIKA-1724
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1724.patch, TIKA-1724.patch
>
>
> This parser implementation caters for files of the [OBO Flat File Format 
> Guide, version 1.4|http://purl.obolibrary.org/obo/oboformat/spec.html] 
> MimeType.
> The OBO format is the text file format used by OBO-Edit, the open source, 
> platform-independent application for viewing and editing ontologies. This 
> file format is used heavily within the clinical and biomedical fields as a 
> particular flat file serialization for ontologies. .obo files are 'typically' 
> accompanied by corresponding .owl serializations as this is also another file 
> format used pervasively within the clinical and biomedical fields.
> I would sincerely appreciate code review. Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1953:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> tika-server NullPointerException while processing rtfs
> --
>
> Key: TIKA-1953
> URL: https://issues.apache.org/jira/browse/TIKA-1953
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
> Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>Reporter: Ravi
>Assignee: Tim Allison
>Priority: Major
>  Labels: newbie, rtf, tika-python, tika-server, xmlContent,
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server 
> returned status: 422 error
> I start the tika server and then run the following code in the python kernel 
> at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I 
> get the right output but setting this flag creates the NullPointerException 
> below
> --
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@21f0dbb9
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnec

[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1840:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1705:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Update ASM dependency to 5.0.4
> --
>
> Key: TIKA-1705
> URL: https://issues.apache.org/jira/browse/TIKA-1705
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.7
>Reporter: Uwe Schindler
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1705-2.patch, TIKA-1705.patch
>
>
> Currently the Class file parser uses ASM 4.1. This older version cannot read 
> Java 8 / Java 9 class files (fails with Exception).
> The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
> code change is only to update the visitor version, so it gets new Java 8 
> features like lambdas reported, but this is not really required, but should 
> be done for full support.
> FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
> 5, too.
> You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
> problem with Lucene using a newer version). Since ASM 4.x the updates are 
> more easy (no visitor interfaces anymore, instead abstract classes), so it 
> does not break if you just replace the JAR file. So just see this as a 
> recommendatation, not urgent! Solr/Lucene will also work without this patch 
> (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1395:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Create embedded image extraction example
> 
>
> Key: TIKA-1395
> URL: https://issues.apache.org/jira/browse/TIKA-1395
> Project: Tika
>  Issue Type: Sub-task
>  Components: example
>Reporter: Tyler Bui-Palsulich
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Create an example of how to turn do embedded image extraction and parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2340) Add explicit deps to tika-parsers which are currently used from transitive scope

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2340:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Add explicit deps to tika-parsers which are currently used from transitive 
> scope
> 
>
> Key: TIKA-2340
> URL: https://issues.apache.org/jira/browse/TIKA-2340
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1454:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> ---
>
> Key: TIKA-1454
> URL: https://issues.apache.org/jira/browse/TIKA-1454
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
> Environment: RedHat EL5, EL6, EL7
>Reporter: Chris Bryant
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1640:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Make ExternalParser support aliases for key names in extracted metadata
> ---
>
> Key: TIKA-1640
> URL: https://issues.apache.org/jira/browse/TIKA-1640
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
> did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
> for this, but one thing Ray's code-based work did that my config oriented 
> work didn't is allow for renaming extracted metadata key names to better 
> support having consistent metadata across parsers.
> Here's one way to do it:
> ExternalParser could have a config section like so:
> {code:xml}
> 
>   
>   
> 
> {code}
> Then this could be used to rename metadata keys.
> I'll implement that in this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1609:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Leverage Google's LibPhonenumber for enhanced phone number extraction and 
> metadata modeling
> ---
>
> Key: TIKA-1609
> URL: https://issues.apache.org/jira/browse/TIKA-1609
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Google's Libphonenumber can provide us with comprehensive support for 
> modeling Phone number metadata properly in Tika.
> During the development of this patch I realized two things, namely
>  * This is not a parser as such as Phone numbers are not mapped to any 
> particular Mimetype
>  * In addition, there can be many phone numbers per document, so this is most 
> likely a Content Handler of sorts
>  * Tika's Metadata support is currently too restrictive to allow us to 
> persist many complex objects e.g. String, Object. We need to expand Meatdata 
> support over and above String, String[].
> https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1688:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika Version in Metadata
> 
>
> Key: TIKA-1688
> URL: https://issues.apache.org/jira/browse/TIKA-1688
> Project: Tika
>  Issue Type: Improvement
>Reporter: Paul Ramirez
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Could this be added as X-Tika:version that way downstream there would be 
> traceability to extraction based on version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1505:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> chmparser breaks down when extracting from file of CHM format v3
> 
>
> Key: TIKA-1505
> URL: https://issues.apache.org/jira/browse/TIKA-1505
> Project: Tika
>  Issue Type: Bug
>Reporter: Bin Hawking
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> chmparser throws exception or returns faulty text when:
> 1. extracting from file of CHM format version 3
> 2. chm file with lzx reset interval > 2
> 3. chm file with >5000 objects
> I am making the fix now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1738:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1390) Create tika-example module

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1390:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Create tika-example module
> --
>
> Key: TIKA-1390
> URL: https://issues.apache.org/jira/browse/TIKA-1390
> Project: Tika
>  Issue Type: Bug
>  Components: example
>Reporter: Tyler Bui-Palsulich
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> This issue will track the initial creation of the tika-example module. 
> Subtasks will be used for the first few examples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1456:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Visual Sentiment API parser
> ---
>
> Key: TIKA-1456
> URL: https://issues.apache.org/jira/browse/TIKA-1456
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
>  Labels: gsoc, gsoc2016
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Integrate the Visual Sentibank API as a parser for images. We can use 
> Aperture from CMU, it's released under the MIT license:
> https://github.com/d8w/aperture



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1598:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Parser Implementation for Streaming Video
> -
>
> Key: TIKA-1598
> URL: https://issues.apache.org/jira/browse/TIKA-1598
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: memex
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> A number of us have been discussing a Tika implementation which could, for 
> example, bind to a live multimedia stream and parse content from the stream 
> until it finished.
> An excellent example would be watching Bonnie Scotland beating R. of Ireland 
> in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
> 17:00 GMT :)
> I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
> http://sourceforge.net/projects/jffmpeg/
> I am not sure... plus it is not licensed liberally enough for us to include 
> so if there are other implementations then please post them here.
> I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1674:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Add example to show how to extract embedded files
> -
>
> Key: TIKA-1674
> URL: https://issues.apache.org/jira/browse/TIKA-1674
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> On tika-user, we received a question on how to extract embedded files.  Let's 
> add an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1417:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Create Extract Embedded Images from PDFs Example
> 
>
> Key: TIKA-1417
> URL: https://issues.apache.org/jira/browse/TIKA-1417
> Project: Tika
>  Issue Type: Improvement
>  Components: example
>Reporter: Tyler Bui-Palsulich
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Users commonly want to "turn on" extraction of images embedded in PDFs (e.g. 
> TIKA-1414). Tika has the capability, but it's not clear how to use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1697:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Parser Implementation for AkomaNtoso Legal XML Documents
> 
>
> Key: TIKA-1697
> URL: https://issues.apache.org/jira/browse/TIKA-1697
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal 
> Document XML standard and used pervasively within parliaments and other 
> legislative arenas.
> This issue should utilize the 
> [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and 
> populate Metadata for AkomaNtoso .xml and .akn documents.
> I'll send a PR for this soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1465:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Implement extraction of non-global variables from netCDF3 and netCDF4
> -
>
> Key: TIKA-1465
> URL: https://issues.apache.org/jira/browse/TIKA-1465
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Speaking to Eric Nienhouse at the ongoing NSF funded Polar 
> Cyberinfrastructure hackathon in NYC, we became aware that variables 
> parameters contained within netCDF3 and netCDF4 are just as valuable (if not 
> more valuable) as global attribute values. 
> AFAIK, right now we only extract global attributes however we could extend 
> the support to cater for the above observations.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1609:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Leverage Google's LibPhonenumber for enhanced phone number extraction and 
> metadata modeling
> ---
>
> Key: TIKA-1609
> URL: https://issues.apache.org/jira/browse/TIKA-1609
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Google's Libphonenumber can provide us with comprehensive support for 
> modeling Phone number metadata properly in Tika.
> During the development of this patch I realized two things, namely
>  * This is not a parser as such as Phone numbers are not mapped to any 
> particular Mimetype
>  * In addition, there can be many phone numbers per document, so this is most 
> likely a Content Handler of sorts
>  * Tika's Metadata support is currently too restrictive to allow us to 
> persist many complex objects e.g. String, Object. We need to expand Meatdata 
> support over and above String, String[].
> https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1276:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1276_20140423_rwesten.diff, 
> TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
> TIKA-1276_20140428_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpe

[jira] [Updated] (TIKA-1952) Access Date is getting modified while capturing the MetaData information using AutoDetectParser

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1952:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Access Date is getting modified while capturing the MetaData information 
> using AutoDetectParser
> ---
>
> Key: TIKA-1952
> URL: https://issues.apache.org/jira/browse/TIKA-1952
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.12
> Environment: Windows
>Reporter: RameshKalidindi
>Priority: Major
>  Labels: features
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> I have been developing a project where in am capturing the MetaData 
> information( like File name, Author, File Extension, Last Modified Date and 
> Access Date) of each file in a folder using AutoDetectParser of Tika, I am 
> able to get meta data information for all files in a given folder, but my 
> issue is that the value of Access Date (MetaData attibute) is getting changed 
> with current date and Time as the program is accessing the each and every 
> file while extracting the MetaData information.
> My Issue : is there anyway that i can get the last Access Date of the file? 
> or can we stop changing Access Date value that was happening due to 
> AutoDetectParser of Tika API. Please help me in this regard. 
> Note: This Access Date information is very important  for my project, based 
> on this we need to build reports.
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1738:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1366:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
> 
>
> Key: TIKA-1366
> URL: https://issues.apache.org/jira/browse/TIKA-1366
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Some of Tika Server services will benefit from optionally supporting JAX-RS 
> 2.0 AsyncResponse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1674:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Add example to show how to extract embedded files
> -
>
> Key: TIKA-1674
> URL: https://issues.apache.org/jira/browse/TIKA-1674
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> On tika-user, we received a question on how to extract embedded files.  Let's 
> add an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1705:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Update ASM dependency to 5.0.4
> --
>
> Key: TIKA-1705
> URL: https://issues.apache.org/jira/browse/TIKA-1705
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.7
>Reporter: Uwe Schindler
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-1705-2.patch, TIKA-1705.patch
>
>
> Currently the Class file parser uses ASM 4.1. This older version cannot read 
> Java 8 / Java 9 class files (fails with Exception).
> The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
> code change is only to update the visitor version, so it gets new Java 8 
> features like lambdas reported, but this is not really required, but should 
> be done for full support.
> FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
> 5, too.
> You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
> problem with Lucene using a newer version). Since ASM 4.x the updates are 
> more easy (no visitor interfaces anymore, instead abstract classes), so it 
> does not break if you just replace the JAR file. So just see this as a 
> recommendatation, not urgent! Solr/Lucene will also work without this patch 
> (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1640:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Make ExternalParser support aliases for key names in extracted metadata
> ---
>
> Key: TIKA-1640
> URL: https://issues.apache.org/jira/browse/TIKA-1640
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
> did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
> for this, but one thing Ray's code-based work did that my config oriented 
> work didn't is allow for renaming extracted metadata key names to better 
> support having consistent metadata across parsers.
> Here's one way to do it:
> ExternalParser could have a config section like so:
> {code:xml}
> 
>   
>   
> 
> {code}
> Then this could be used to rename metadata keys.
> I'll implement that in this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1328:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Translate Metadata and Content
> --
>
> Key: TIKA-1328
> URL: https://issues.apache.org/jira/browse/TIKA-1328
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Tyler Bui-Palsulich
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Right now, Translation is only done on Strings. Ideally, users would be able 
> to "turn on" translation while parsing. I can think of a couple options:
> - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
> it, then translate the content.
> - Make a Context switch. When true, translate the content regardless of the 
> parser used. I'm not sure the best way to go about this method, but I prefer 
> it over another Parser.
> Regardless, we need a black or white list for translation. I think black list 
> would be the way to go -- which fields should not be translated (dates, 
> versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
> other open source translation libraries? If we were really lucky, it wouldn't 
> depend on an online service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1616:
--
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Tika Parser for GIBS Metadata
> -
>
> Key: TIKA-1616
> URL: https://issues.apache.org/jira/browse/TIKA-1616
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
>  metadata currently consists of simple stuff in the WMTS GetCapabilities 
> request (e.g. 
> http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
> which includes available layers, extents, time ranges, map projections, color 
> maps, etc. We will eventually have more detailed visualization metadata 
> available in ECHO/CMR which will include linkages to data products, 
> provenance, etc. 
> Some investigation and a Tika parser would be excellent to extract and 
> assimilate GIBS Metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1417:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> Create Extract Embedded Images from PDFs Example
> 
>
> Key: TIKA-1417
> URL: https://issues.apache.org/jira/browse/TIKA-1417
> Project: Tika
>  Issue Type: Improvement
>  Components: example
>Reporter: Tyler Bui-Palsulich
>Priority: Minor
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Users commonly want to "turn on" extraction of images embedded in PDFs (e.g. 
> TIKA-1414). Tika has the capability, but it's not clear how to use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1800) MediaType#parse does not decode escaped special characters

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1800:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1577:
--
Fix Version/s: (was: 2.0.0)
   2.0.1

> NetCDF Data Extraction
> --
>
> Key: TIKA-1577
> URL: https://issues.apache.org/jira/browse/TIKA-1577
> Project: Tika
>  Issue Type: Improvement
>  Components: handler, parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>Priority: Major
>  Labels: features, handler
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >