[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504150#comment-17504150
 ] 

Nick Burch commented on TIKA-3684:
--

Same as Tika 2.x - pass a {{--config}} flag when you start the server

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison merged pull request #527: Bump dl4j.version from 1.0.0-M1.1 to 1.0.0-M2

2022-03-10 Thread GitBox


tballison merged pull request #527:
URL: https://github.com/apache/tika/pull/527


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504214#comment-17504214
 ] 

Tim Allison commented on TIKA-3696:
---

Unless we hear otherwise from 
https://github.com/webrecorder/wacz-spec/issues/41, let's go with 
{{application/wacz}} a subclass of {{application/vnd.datapackage}}, which in 
turn is a subclass of {{application/zip}}

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504214#comment-17504214
 ] 

Tim Allison edited comment on TIKA-3696 at 3/10/22, 12:09 PM:
--

Unless we hear otherwise from 
https://github.com/webrecorder/wacz-spec/issues/41, let's go with 
{{application/wacz}} a subclass of {{application/vnd.datapackage}}, which in 
turn is a subclass of {{application/zip}}?


was (Author: talli...@mitre.org):
Unless we hear otherwise from 
https://github.com/webrecorder/wacz-spec/issues/41, let's go with 
{{application/wacz}} a subclass of {{application/vnd.datapackage}}, which in 
turn is a subclass of {{application/zip}}

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504378#comment-17504378
 ] 

Nick Burch commented on TIKA-3696:
--

Shouldn't it be more like {{application/x-wacz}}  since it isn't a standard / 
official one?

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504530#comment-17504530
 ] 

Tim Allison commented on TIKA-3696:
---

+1. Thank you!

> Add detection for wacz files
> 
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3696) Add detection for wacz files and frictionless data packages

2022-03-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3696:
--
Summary: Add detection for wacz files and frictionless data packages  (was: 
Add detection for wacz files)

> Add detection for wacz files and frictionless data packages
> ---
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3696) Add detection for wacz files and frictionless data packages

2022-03-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3696.
---
Fix Version/s: 2.3.1
   Resolution: Fixed

> Add detection for wacz files and frictionless data packages
> ---
>
> Key: TIKA-3696
> URL: https://issues.apache.org/jira/browse/TIKA-3696
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.3.1
>
>
> https://webrecorder.github.io/wacz-spec/1.2.0/
> Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 
> 'pages'.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3698) Duplicate subject/description for Outlook msgs

2022-03-10 Thread Tim Allison (Jira)
Tim Allison created TIKA-3698:
-

 Summary: Duplicate subject/description for Outlook msgs
 Key: TIKA-3698
 URL: https://issues.apache.org/jira/browse/TIKA-3698
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


On TIKA-3629, despite our best efforts to simplify and streamline metadata 
keys, we backed off and continued to include/added back keywords _and_ subject.

Another area where we should probably include both includes msg files.

POI's msg.getSubject() is going to "dc:title", and msg.getConversationTopic() 
is going to "dc:description".  Along the lines of what we did on TIKA-3629, I 
propose adding msg.getConversationTopic() also under the key "dc:subject".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3698) Duplicate subject/description for Outlook msgs

2022-03-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3698.
---
Fix Version/s: 2.3.1
   Resolution: Fixed

> Duplicate subject/description for Outlook msgs
> --
>
> Key: TIKA-3698
> URL: https://issues.apache.org/jira/browse/TIKA-3698
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.3.1
>
>
> On TIKA-3629, despite our best efforts to simplify and streamline metadata 
> keys, we backed off and continued to include/added back keywords _and_ 
> subject.
> Another area where we should probably include both includes msg files.
> POI's msg.getSubject() is going to "dc:title", and msg.getConversationTopic() 
> is going to "dc:description".  Along the lines of what we did on TIKA-3629, I 
> propose adding msg.getConversationTopic() also under the key "dc:subject".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3697) Add parser for warc files

2022-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504549#comment-17504549
 ] 

Tim Allison commented on TIKA-3697:
---

I haven't heard any objections.  I'll plan to put this into 
"tika-parsers-standard."

> Add parser for warc files
> -
>
> Key: TIKA-3697
> URL: https://issues.apache.org/jira/browse/TIKA-3697
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> netpreserve's jwarc is ASL 2.0, fairly small and no dependencies.
> Should we add this into tika-parsers-standard or create a separate package 
> for it in tika-parsers-extended?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)