[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-08 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871
 ] 

Sam Stephens commented on TIKA-3768:


Ah, interesting, this is a case of me misunderstanding the product then.

This means that in order to actually get all the text possible out of a file, I 
need to examine both the actual text and the metadata (I'm using this for 
building a search over documents of many types).

The challenge then is that some fields in the metadata object are sourced from 
text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and 
should be searchable, and some that are not (such as {{Content-Type}} and 
{{{}X-TIKA:Parsed-By{}}}), and should not be searchable.

Is there any documentation of the set of possible metadata fields? The 
constants inherited by 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] 
don't appear to be a complete set, as I don't see {{dc:subject}} amongst them.

It looks to me like I could strip out fields like {{Content-Type}} as listed in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and 
any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields 
would be sourced from document text.

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3788) Allow embedded exceptions and warnings to percolate to the parent's metadata

2022-06-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551812#comment-17551812
 ] 

Hudson commented on TIKA-3788:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #634 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/634/])
TIKA-3788 -- Record embedded file exceptions in the container file's metadata. 
(tallison: 
[https://github.com/apache/tika/commit/6f2ef64a582328fb13198c97d51205b4d469424e])
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java
* (edit) CHANGES.txt
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock/null_pointer.xml.gz
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/ParseRecord.java


> Allow embedded exceptions and warnings to percolate to the parent's metadata
> 
>
> Key: TIKA-3788
> URL: https://issues.apache.org/jira/browse/TIKA-3788
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.1
>
>
> As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  
> This can be used by parsers that parse embedded files to record caught 
> exceptions and warning messages.  The CompositeParser keeps track of depth of 
> its parse and when the depth returns to 0, it will write these exceptions and 
> warnings to the Metadata object.
> I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but 
> this new capability adds some functionality to the standard /tika (with json 
> output), and programmatically to the AutoDetectParser.
> Because this information is added to the metadata object _after_ the parse, 
> it will not come through in streaming contexts where the metadata object is 
> written to the xhtml before the content of the file is parsed.  So, this will 
> not add any benefit to /tika (text/html).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3787) Keep processing on write limit reached

2022-06-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551811#comment-17551811
 ] 

Hudson commented on TIKA-3787:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #634 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/634/])
TIKA-3787 -- allow parse to continue after writelimit has been reached 
(tallison: 
[https://github.com/apache/tika/commit/7c93ddf7e3183fcbd811e04c1621455d961b1bb5])
* (edit) tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaResourceTest.java
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaPipesTest.java
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java
* (edit) tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaPipesTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
* (edit) 
tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonFetchEmitTupleTest.java
* (edit) CHANGES.txt
* (edit) 
tika-core/src/main/java/org/apache/tika/pipes/pipesiterator/PipesIterator.java
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaResourceTest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java
* (edit) 
tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (add) tika-core/src/main/java/org/apache/tika/parser/ParseRecord.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/HandlerConfig.java
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (add) tika-core/src/main/java/org/apache/tika/sax/WriteLimiter.java
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/RecursiveMetadataResource.java


> Keep processing on write limit reached
> --
>
> Key: TIKA-3787
> URL: https://issues.apache.org/jira/browse/TIKA-3787
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.4.1
>
>
> In some use cases, users may want to keep parsing files even after the write 
> limit has been reached.  For example, in a container file, a user may want to 
> extract all the metadata from the embedded files even after the write limit 
> on the stream is reached.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3788) Allow embedded exceptions and warnings to percolate to the parent's metadata

2022-06-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551786#comment-17551786
 ] 

Tim Allison commented on TIKA-3788:
---

Updated: 
https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared

> Allow embedded exceptions and warnings to percolate to the parent's metadata
> 
>
> Key: TIKA-3788
> URL: https://issues.apache.org/jira/browse/TIKA-3788
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.1
>
>
> As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  
> This can be used by parsers that parse embedded files to record caught 
> exceptions and warning messages.  The CompositeParser keeps track of depth of 
> its parse and when the depth returns to 0, it will write these exceptions and 
> warnings to the Metadata object.
> I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but 
> this new capability adds some functionality to the standard /tika (with json 
> output), and programmatically to the AutoDetectParser.
> Because this information is added to the metadata object _after_ the parse, 
> it will not come through in streaming contexts where the metadata object is 
> written to the xhtml before the content of the file is parsed.  So, this will 
> not add any benefit to /tika (text/html).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3788) Allow embedded exceptions and warnings to percolate to the parent's metadata

2022-06-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3788:
--
Description: 
As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  This 
can be used by parsers that parse embedded files to record caught exceptions 
and warning messages.  The CompositeParser keeps track of depth of its parse 
and when the depth returns to 0, it will write these exceptions and warnings to 
the Metadata object.

I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but this 
new capability adds some functionality to the standard /tika (with json 
output), and programmatically to the AutoDetectParser.

Because this information is added to the metadata object _after_ the parse, it 
will not come through in streaming contexts where the metadata object is 
written to the xhtml before the content of the file is parsed.  So, this will 
not add any benefit to /tika (text/html).

  was:
As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  This 
can be used by parsers that parse embedded files to record caught exceptions 
and warning messages.  The CompositeParser keeps track of depth of its parse 
and when the depth returns to 0, it will write these exceptions and warnings to 
the Metadata object.

I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but this 
new capability adds some functionality to the standard /tika (with json 
output), and programmatically to the AutoDetectParser.

Because this information is added to the metadata object _after_ the parse, it 
will not come through in streaming contexts where the metadata object has is 
written to the xhtml before the content of the file is parsed.  So, this will 
not add any benefit to /tika (text/html).


> Allow embedded exceptions and warnings to percolate to the parent's metadata
> 
>
> Key: TIKA-3788
> URL: https://issues.apache.org/jira/browse/TIKA-3788
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.1
>
>
> As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  
> This can be used by parsers that parse embedded files to record caught 
> exceptions and warning messages.  The CompositeParser keeps track of depth of 
> its parse and when the depth returns to 0, it will write these exceptions and 
> warnings to the Metadata object.
> I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but 
> this new capability adds some functionality to the standard /tika (with json 
> output), and programmatically to the AutoDetectParser.
> Because this information is added to the metadata object _after_ the parse, 
> it will not come through in streaming contexts where the metadata object is 
> written to the xhtml before the content of the file is parsed.  So, this will 
> not add any benefit to /tika (text/html).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3787) Keep processing on write limit reached

2022-06-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3787.
---
Fix Version/s: 2.4.1
   Resolution: Fixed

> Keep processing on write limit reached
> --
>
> Key: TIKA-3787
> URL: https://issues.apache.org/jira/browse/TIKA-3787
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.4.1
>
>
> In some use cases, users may want to keep parsing files even after the write 
> limit has been reached.  For example, in a container file, a user may want to 
> extract all the metadata from the embedded files even after the write limit 
> on the stream is reached.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3788) Allow embedded exceptions and warnings to percolate to the parent's metadata

2022-06-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3788.
---
Fix Version/s: 2.4.1
   Resolution: Fixed

> Allow embedded exceptions and warnings to percolate to the parent's metadata
> 
>
> Key: TIKA-3788
> URL: https://issues.apache.org/jira/browse/TIKA-3788
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.1
>
>
> As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  
> This can be used by parsers that parse embedded files to record caught 
> exceptions and warning messages.  The CompositeParser keeps track of depth of 
> its parse and when the depth returns to 0, it will write these exceptions and 
> warnings to the Metadata object.
> I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but 
> this new capability adds some functionality to the standard /tika (with json 
> output), and programmatically to the AutoDetectParser.
> Because this information is added to the metadata object _after_ the parse, 
> it will not come through in streaming contexts where the metadata object has 
> is written to the xhtml before the content of the file is parsed.  So, this 
> will not add any benefit to /tika (text/html).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3788) Allow embedded exceptions and warnings to percolate to the parent's metadata

2022-06-08 Thread Tim Allison (Jira)
Tim Allison created TIKA-3788:
-

 Summary: Allow embedded exceptions and warnings to percolate to 
the parent's metadata
 Key: TIKA-3788
 URL: https://issues.apache.org/jira/browse/TIKA-3788
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


As part of work on TIKA-3787, I'll add a ParseRecord to the ParseContext.  This 
can be used by parsers that parse embedded files to record caught exceptions 
and warning messages.  The CompositeParser keeps track of depth of its parse 
and when the depth returns to 0, it will write these exceptions and warnings to 
the Metadata object.

I would still highly recommend /rmeta, -J, the RecursiveParserWrapper, but this 
new capability adds some functionality to the standard /tika (with json 
output), and programmatically to the AutoDetectParser.

Because this information is added to the metadata object _after_ the parse, it 
will not come through in streaming contexts where the metadata object has is 
written to the xhtml before the content of the file is parsed.  So, this will 
not add any benefit to /tika (text/html).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3751) General upgrades for 2.4.1

2022-06-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551519#comment-17551519
 ] 

Hudson commented on TIKA-3751:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #632 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/632/])
TIKA-3751: Update bndlib and aws (tilman: 
[https://github.com/apache/tika/commit/c4e83614ff12116ef83a7be30c0e18ec68dba8c2])
* (edit) tika-parent/pom.xml


> General upgrades for 2.4.1
> --
>
> Key: TIKA-3751
> URL: https://issues.apache.org/jira/browse/TIKA-3751
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3780) General upgrades for 1.28.4

2022-06-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551472#comment-17551472
 ] 

Hudson commented on TIKA-3780:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #224 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/224/])
TIKA-3780: Update bndlib (tilman: 
[https://github.com/apache/tika/commit/f10296f251b604dc9ac7cdf037ba30b08d702189])
* (edit) tika-parent/pom.xml


> General upgrades for 1.28.4
> ---
>
> Key: TIKA-3780
> URL: https://issues.apache.org/jira/browse/TIKA-3780
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 1.28.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)