[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842471#comment-17842471
 ] 

Hudson commented on TIKA-4248:
--

SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk11 #1617 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1617/])
TIKA-4248 -- improve handling of attachments in PST (#1738) (github: 
[https://github.com/apache/tika/commit/de282d2861009895eecdb07784dceb5d777f372a])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/JSoupParser.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Office.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParser.java
* (add) tika-core/src/main/java/org/apache/tika/metadata/PST.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/PSTMailItemParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842432#comment-17842432
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison merged PR #1738:
URL: https://github.com/apache/tika/pull/1738




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842399#comment-17842399
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison opened a new pull request, #1738:
URL: https://github.com/apache/tika/pull/1738

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)