[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842471#comment-17842471
 ] 

Hudson commented on TIKA-4248:
--

SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk11 #1617 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1617/])
TIKA-4248 -- improve handling of attachments in PST (#1738) (github: 
[https://github.com/apache/tika/commit/de282d2861009895eecdb07784dceb5d777f372a])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/JSoupParser.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Office.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParser.java
* (add) tika-core/src/main/java/org/apache/tika/metadata/PST.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/PSTMailItemParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tika User (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tika User updated TIKA-4249:

Description: We recently upgrade from 2.9.0 to 2.9.2. In that we found that 
the attached file is treating it as text file instead of email file. please 
look into this issue.  (was: We recently upgrade from 3.9.0 to 3.9.2. In that 
we found that the attached file is treating it as text file instead of email 
file. please look into this issue.)

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tika User (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tika User updated TIKA-4249:

Attachment: (was: Email_Received.txt)

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842432#comment-17842432
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison merged PR #1738:
URL: https://github.com/apache/tika/pull/1738




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4248 -- improve handling of attachments in PST [tika]

2024-04-30 Thread via GitHub


tballison merged PR #1738:
URL: https://github.com/apache/tika/pull/1738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842431#comment-17842431
 ] 

ASF GitHub Bot commented on TIKA-4249:
--

tballison opened a new pull request, #1739:
URL: https://github.com/apache/tika/pull/1739

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4249 -- allow utf8 bom in rfc822 [tika]

2024-04-30 Thread via GitHub


tballison opened a new pull request, #1739:
URL: https://github.com/apache/tika/pull/1739

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842405#comment-17842405
 ] 

Tim Allison commented on TIKA-4249:
---

Files never cease to amaze!

Thank you. Onwards!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842403#comment-17842403
 ] 

Nick Burch commented on TIKA-4249:
--

I'd probably say we change the 0="From:" into "0=From" or "0=(UTF-8-BOM)From:", 
should be a little less likely to have false positives that way

First time I've come across a Byte Order Mark at the start of an email file 
though!

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842402#comment-17842402
 ] 

Tim Allison commented on TIKA-4249:
---

Modifying the first hit from {{offset="0"}} to {{offset="0:3"}} works.

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842401#comment-17842401
 ] 

Tim Allison commented on TIKA-4249:
---

I'm guessing you mean 2.9.0->2.9.2.

The challenge with this file is that there's a UTF-8 bom at the beginning of 
the file so that our matching on, e.g. "From:" at offset 0 does not work.

[~nick], any recommendations?

> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842399#comment-17842399
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison opened a new pull request, #1738:
URL: https://github.com/apache/tika/pull/1738

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4248 -- improve handling of attachments in PST [tika]

2024-04-30 Thread via GitHub


tballison opened a new pull request, #1738:
URL: https://github.com/apache/tika/pull/1738

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tika User (Jira)
Tika User created TIKA-4249:
---

 Summary: EML file is treating it as text file in 3.9.2 version
 Key: TIKA-4249
 URL: https://issues.apache.org/jira/browse/TIKA-4249
 Project: Tika
  Issue Type: Bug
Reporter: Tika User
 Attachments: Email_Received.txt

We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
file is treating it as text file instead of email file. please look into this 
issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)