[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552078#comment-17552078
 ] 

Nick Burch commented on TIKA-3768:
--

If we can put something into a properly typed + structured metadata field, we 
will!

The full list of metadata property definitions are spread across the interface 
in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/package-summary.html]
 grouped by type. Wherever possible we re-use existing well known definitions

While we always store the metadata values as strings, the definition properties 
will help you turn it back into the underlying java types, eg get the date back 
as a java Date

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3789) Allow parsers to pass embedded metadata to container file's metadata

2022-06-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-3789:
-

 Summary: Allow parsers to pass embedded metadata to container 
file's metadata
 Key: TIKA-3789
 URL: https://issues.apache.org/jira/browse/TIKA-3789
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


There are some use cases where custom parsers might want to pass metadata from 
embedded files to the parent's metadata in the /tika (json) output or 
programmatically.

We can follow the pattern in TIKA-3788.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3789) Allow parsers to pass embedded metadata to container file's metadata

2022-06-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3789:
--
Description: 
There are some use cases where custom parsers might want to pass metadata from 
embedded files to the parent's metadata in the /tika (json) output or 
programmatically.

We can follow the pattern in TIKA-3788.

As with TIKA-3788, this metadata will be written after the parse so it will not 
show up in standard xhtml output (e.g. /tika (html/xhtml) or programmatically 
in the XHTMLContentHandler).  However, it will appear in the json output option 
from /tika and in the Metadata object programmatically.

  was:
There are some use cases where custom parsers might want to pass metadata from 
embedded files to the parent's metadata in the /tika (json) output or 
programmatically.

We can follow the pattern in TIKA-3788.


> Allow parsers to pass embedded metadata to container file's metadata
> 
>
> Key: TIKA-3789
> URL: https://issues.apache.org/jira/browse/TIKA-3789
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> There are some use cases where custom parsers might want to pass metadata 
> from embedded files to the parent's metadata in the /tika (json) output or 
> programmatically.
> We can follow the pattern in TIKA-3788.
> As with TIKA-3788, this metadata will be written after the parse so it will 
> not show up in standard xhtml output (e.g. /tika (html/xhtml) or 
> programmatically in the XHTMLContentHandler).  However, it will appear in the 
> json output option from /tika and in the Metadata object programmatically.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3789) Allow parsers to pass embedded metadata to container file's metadata

2022-06-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3789:
--
Description: 
There are some use cases where custom parsers might want to pass metadata from 
embedded files to the parent's metadata in the /tika (json) output or 
programmatically.

We can follow the pattern in TIKA-3788.

As with TIKA-3788, this metadata will be written after the parse so it will not 
show up in standard xhtml output (e.g. /tika (html/xhtml) or programmatically 
in the XHTMLContentHandler).  However, it will appear in the json output option 
from /tika and in the Metadata object programmatically.

As with TIKA-3788, we encourage using the /rmeta endpoint, -J in tika-app or 
the RecursiveParserWrapper instead of this option.  However, for those who need 
to work with a flattened view of a document, this can be invaluable.

  was:
There are some use cases where custom parsers might want to pass metadata from 
embedded files to the parent's metadata in the /tika (json) output or 
programmatically.

We can follow the pattern in TIKA-3788.

As with TIKA-3788, this metadata will be written after the parse so it will not 
show up in standard xhtml output (e.g. /tika (html/xhtml) or programmatically 
in the XHTMLContentHandler).  However, it will appear in the json output option 
from /tika and in the Metadata object programmatically.


> Allow parsers to pass embedded metadata to container file's metadata
> 
>
> Key: TIKA-3789
> URL: https://issues.apache.org/jira/browse/TIKA-3789
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> There are some use cases where custom parsers might want to pass metadata 
> from embedded files to the parent's metadata in the /tika (json) output or 
> programmatically.
> We can follow the pattern in TIKA-3788.
> As with TIKA-3788, this metadata will be written after the parse so it will 
> not show up in standard xhtml output (e.g. /tika (html/xhtml) or 
> programmatically in the XHTMLContentHandler).  However, it will appear in the 
> json output option from /tika and in the Metadata object programmatically.
> As with TIKA-3788, we encourage using the /rmeta endpoint, -J in tika-app or 
> the RecursiveParserWrapper instead of this option.  However, for those who 
> need to work with a flattened view of a document, this can be invaluable.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3789) Allow parsers to pass embedded metadata to container file's metadata

2022-06-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552262#comment-17552262
 ] 

Hudson commented on TIKA-3789:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #635 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/635/])
TIKA-3789: Allow custom embedded parsers and EmbeddedDocumentHandlers to add 
metadata to the container file's metadata (tallison: 
[https://github.com/apache/tika/commit/3778ecb131a379a8445b5cf5ce5cc9d37069f7f2])
* (edit) tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/ParseRecord.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock/embedded_to_parent_metadata.xml.gz
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java


> Allow parsers to pass embedded metadata to container file's metadata
> 
>
> Key: TIKA-3789
> URL: https://issues.apache.org/jira/browse/TIKA-3789
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> There are some use cases where custom parsers might want to pass metadata 
> from embedded files to the parent's metadata in the /tika (json) output or 
> programmatically.
> We can follow the pattern in TIKA-3788.
> As with TIKA-3788, this metadata will be written after the parse so it will 
> not show up in standard xhtml output (e.g. /tika (html/xhtml) or 
> programmatically in the XHTMLContentHandler).  However, it will appear in the 
> json output option from /tika and in the Metadata object programmatically.
> As with TIKA-3788, we encourage using the /rmeta endpoint, -J in tika-app or 
> the RecursiveParserWrapper instead of this option.  However, for those who 
> need to work with a flattened view of a document, this can be invaluable.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3780) General upgrades for 1.28.4

2022-06-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552533#comment-17552533
 ] 

Hudson commented on TIKA-3780:
--

FAILURE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #225 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/225/])
TIKA-3780: Update uimafit-core (tilman: 
[https://github.com/apache/tika/commit/5bba052fa6338d8c8182a858d65152cb8940ae3b])
* (edit) tika-parsers/pom.xml


> General upgrades for 1.28.4
> ---
>
> Key: TIKA-3780
> URL: https://issues.apache.org/jira/browse/TIKA-3780
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.28.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3751) General upgrades for 2.4.1

2022-06-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552553#comment-17552553
 ] 

Hudson commented on TIKA-3751:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #636 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/636/])
TIKA-3751: Update google cloud and aws (tilman: 
[https://github.com/apache/tika/commit/4131328ee2672e3e62688ce9c46bcd9cfc0d4ad8])
* (edit) tika-parent/pom.xml
TIKA-3751: Update azure-storage-blob (tilman: 
[https://github.com/apache/tika/commit/6d096d6aea059ba7e1d315cea72c772615d59046])
* (edit) tika-pipes/pom.xml


> General upgrades for 2.4.1
> --
>
> Key: TIKA-3751
> URL: https://issues.apache.org/jira/browse/TIKA-3751
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3751) General upgrades for 2.4.1

2022-06-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552575#comment-17552575
 ] 

Hudson commented on TIKA-3751:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #637 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/637/])
TIKA-3751: Update netty (tilman: 
[https://github.com/apache/tika/commit/fae27ea63db4866f0b27232caf38aa3f151044d5])
* (edit) tika-parsers/tika-parsers-ml/tika-age-recogniser/pom.xml


> General upgrades for 2.4.1
> --
>
> Key: TIKA-3751
> URL: https://issues.apache.org/jira/browse/TIKA-3751
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)