[GitHub] [tika] dependabot[bot] opened a new pull request, #567: Bump maven-enforcer-plugin from 3.0.0-M3 to 3.0.0

2022-05-19 Thread GitBox


dependabot[bot] opened a new pull request, #567:
URL: https://github.com/apache/tika/pull/567

   Bumps [maven-enforcer-plugin](https://github.com/apache/maven-enforcer) from 
3.0.0-M3 to 3.0.0.
   
   Commits
   
   https://github.com/apache/maven-enforcer/commit/b1b22822174bc92857a2e674c9a024035ee6d7cd;>b1b2282
 [maven-release-plugin] prepare release enforcer-3.0.0
   https://github.com/apache/maven-enforcer/commit/70de3ad6b6cf83505fe049896e37d90ac81e13f3;>70de3ad
 Lock maven-jxr-plugin
   https://github.com/apache/maven-enforcer/commit/da3f8886d41522450c4b187a5f3562a4f6309610;>da3f888
 Fix JavaDoc and lock sisu-maven-plugin
   https://github.com/apache/maven-enforcer/commit/014253f19260b04eedccfd00678b2777f93fa4e3;>014253f
 update CI url
   https://github.com/apache/maven-enforcer/commit/5409be83dc3b621121e6222ad3830f8e95cf6614;>5409be8
 [MENFORCER-211] wildcard ignore in requireReleaseDeps
   https://github.com/apache/maven-enforcer/commit/335f26e39d1f20e157c46485481e36f858135a14;>335f26e
 [MENFORCER-364] requireFilesExist rule should be case sensitive
   https://github.com/apache/maven-enforcer/commit/faaf5c118bd9cda06cecca94ab3f9656c1cb7927;>faaf5c1
 [MENFORCER-280] Enforcer dependency convergence stumbles on selenium-java
   https://github.com/apache/maven-enforcer/commit/ab53fd99607eb36554f2fd3af41847ad9568a5ed;>ab53fd9
 [MENFORCER-357] RequirePluginVersions not recognizing 
versions-from-properties
   https://github.com/apache/maven-enforcer/commit/1b8ca8f82815ec721e09abbd2330ce315893f2ed;>1b8ca8f
 [MENFORCER-388] Extends RequirePluginVersions with banMavenDefaults
   https://github.com/apache/maven-enforcer/commit/ca73329888b925899f4f57419a1d2ed208b1e0c4;>ca73329
 [MENFORCER-359] RequirePluginVersions fails when versions are inherited
   Additional commits viewable in https://github.com/apache/maven-enforcer/compare/enforcer-3.0.0-M3...enforcer-3.0.0;>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-enforcer-plugin=maven=3.0.0-M3=3.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3770) General upgrades for 1.28.3

2022-05-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539909#comment-17539909
 ] 

Hudson commented on TIKA-3770:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #210 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/210/])
TIKA-3770: revert update of jakarta.annotation-api, fails on jdk11+ (tilman: 
[https://github.com/apache/tika/commit/8dc66e598bd4d2a481b293a8bd61fe00ecc7a1d0])
* (edit) tika-parent/pom.xml
* (edit) tika-parsers/pom.xml


> General upgrades for 1.28.3
> ---
>
> Key: TIKA-3770
> URL: https://issues.apache.org/jira/browse/TIKA-3770
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated TIKA-3771:
-
Description: 
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples from 1M of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.

  was:
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.


> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the  type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated TIKA-3771:
-
Description: 
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.

  was:
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). 


> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples of different file types now are being 
> detected as EML. This is caused by the  offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. 
> Attached is a sample PNG file that triggers this (it also has another \nDate: 
> value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-19 Thread Jira
Luís Filipe Nassif created TIKA-3771:


 Summary: Regression from TIKA-3687: Files wrongly detected as EML 
 Key: TIKA-3771
 URL: https://issues.apache.org/jira/browse/TIKA-3771
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Luís Filipe Nassif
 Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png

Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3770) General upgrades for 1.28.3

2022-05-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539896#comment-17539896
 ] 

Hudson commented on TIKA-3770:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #209 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/209/])
TIKA-3770: update lombok and jakarta.annotation-api (tilman: 
[https://github.com/apache/tika/commit/4009fc6a518059141ff54d4a005bddab72954938])
* (edit) tika-parsers/pom.xml
* (edit) tika-parent/pom.xml


> General upgrades for 1.28.3
> ---
>
> Key: TIKA-3770
> URL: https://issues.apache.org/jira/browse/TIKA-3770
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2022-05-19 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539809#comment-17539809
 ] 

Dan Coldrick commented on TIKA-1570:


[~tallison] I've tested and it works, I've created a WIP page in confluence on 
how I got it to install as a Windows service.

I needed a break from DWG's so picked this up instead :) Feel free to butcher 
my confluence page:

[https://cwiki.apache.org/confluence/display/TIKA/TikaServer+Windows+Service+-+WIP]

 

> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
> Fix For: 2.4.1
>
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3770) General upgrades for 1.28.3

2022-05-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539700#comment-17539700
 ] 

Hudson commented on TIKA-3770:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #208 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/208/])
TIKA-3770: update micrometer (tilman: 
[https://github.com/apache/tika/commit/74b08d1234280da5e2475f1716b7a5437cfd])
* (edit) tika-server/pom.xml
TIKA-3770: update zstd-jni (tilman: 
[https://github.com/apache/tika/commit/7fd12a3f53773f10aa8a8ec7d640e25a87658188])
* (edit) tika-parsers/pom.xml


> General upgrades for 1.28.3
> ---
>
> Key: TIKA-3770
> URL: https://issues.apache.org/jira/browse/TIKA-3770
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Automatic updates?

2022-05-19 Thread Tim Allison
It took me some googling... I had forgotten about that step.

Great news.  Thank you!

On Thu, May 19, 2022 at 12:27 PM Tilman Hausherr 
wrote:

> Am 19.05.2022 um 18:16 schrieb Tim Allison:
> > Hmmm... what do you see here: https://gitbox.apache.org/boxer/
>
> Thank you, I hadn't known about that. I have now linked my account. The
> appearance of
>
> https://github.com/apache/tika/pull/561/
>
> has now changed, cool!
>
> Tilman
>
>
>
> >
> > On Thu, May 19, 2022 at 11:58 AM Tilman Hausherr 
> > wrote:
> >
> >> I'm unable to show it now, but I never had a "merge" button. But I
> >> remember a "Only those with write access to this repository can merge
> >> pull requests" text, could it be that I need some additional
> permissions?
> >>
> >> Tilman
> >>
> >> Am 19.05.2022 um 14:00 schrieb Tim Allison:
> >>> I just click the button.  Is your username asfgit?  Kidding!  I have no
> >>> idea why your name isn't showing up.
> >>>
> >>> On Thu, May 19, 2022 at 12:03 AM Tilman Hausherr >
> >>> wrote:
> >>>
>  What do you do differently than I do? I noticed that in the recent
>  commit, your name appears in the PR, mine doesn't.
> 
>  https://github.com/apache/tika/pull/562/
>  https://github.com/apache/tika/pull/563/
>  "asfgit"
> 
>  https://github.com/apache/tika/pull/564/
>  "tballison"
> 
>  Could it because I'm using my real mail address in commits and you're
>  using your apache mail address?
> 
>  Tilman
> >>
>
>


Re: Automatic updates?

2022-05-19 Thread Tilman Hausherr

Am 19.05.2022 um 18:16 schrieb Tim Allison:

Hmmm... what do you see here: https://gitbox.apache.org/boxer/


Thank you, I hadn't known about that. I have now linked my account. The 
appearance of


https://github.com/apache/tika/pull/561/

has now changed, cool!

Tilman





On Thu, May 19, 2022 at 11:58 AM Tilman Hausherr 
wrote:


I'm unable to show it now, but I never had a "merge" button. But I
remember a "Only those with write access to this repository can merge
pull requests" text, could it be that I need some additional permissions?

Tilman

Am 19.05.2022 um 14:00 schrieb Tim Allison:

I just click the button.  Is your username asfgit?  Kidding!  I have no
idea why your name isn't showing up.

On Thu, May 19, 2022 at 12:03 AM Tilman Hausherr
wrote:


What do you do differently than I do? I noticed that in the recent
commit, your name appears in the PR, mine doesn't.

https://github.com/apache/tika/pull/562/
https://github.com/apache/tika/pull/563/
"asfgit"

https://github.com/apache/tika/pull/564/
"tballison"

Could it because I'm using my real mail address in commits and you're
using your apache mail address?

Tilman






Re: Automatic updates?

2022-05-19 Thread Tim Allison
Hmmm... what do you see here: https://gitbox.apache.org/boxer/

On Thu, May 19, 2022 at 11:58 AM Tilman Hausherr 
wrote:

> I'm unable to show it now, but I never had a "merge" button. But I
> remember a "Only those with write access to this repository can merge
> pull requests" text, could it be that I need some additional permissions?
>
> Tilman
>
> Am 19.05.2022 um 14:00 schrieb Tim Allison:
> > I just click the button.  Is your username asfgit?  Kidding!  I have no
> > idea why your name isn't showing up.
> >
> > On Thu, May 19, 2022 at 12:03 AM Tilman Hausherr
> > wrote:
> >
> >> What do you do differently than I do? I noticed that in the recent
> >> commit, your name appears in the PR, mine doesn't.
> >>
> >> https://github.com/apache/tika/pull/562/
> >> https://github.com/apache/tika/pull/563/
> >> "asfgit"
> >>
> >> https://github.com/apache/tika/pull/564/
> >> "tballison"
> >>
> >> Could it because I'm using my real mail address in commits and you're
> >> using your apache mail address?
> >>
> >> Tilman
>
>


Re: Automatic updates?

2022-05-19 Thread Tilman Hausherr
I'm unable to show it now, but I never had a "merge" button. But I 
remember a "Only those with write access to this repository can merge 
pull requests" text, could it be that I need some additional permissions?


Tilman

Am 19.05.2022 um 14:00 schrieb Tim Allison:

I just click the button.  Is your username asfgit?  Kidding!  I have no
idea why your name isn't showing up.

On Thu, May 19, 2022 at 12:03 AM Tilman Hausherr
wrote:


What do you do differently than I do? I noticed that in the recent
commit, your name appears in the PR, mine doesn't.

https://github.com/apache/tika/pull/562/
https://github.com/apache/tika/pull/563/
"asfgit"

https://github.com/apache/tika/pull/564/
"tballison"

Could it because I'm using my real mail address in commits and you're
using your apache mail address?

Tilman




[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539607#comment-17539607
 ] 

Tim Allison commented on TIKA-3710:
---

The current main block is 40, which is intentionally below RFC822.

How's this look:

{noformat}

  
  




  
  
...
{noformat}

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594
 ] 

Nick Burch commented on TIKA-3710:
--

As a "normal" html file wouldn't start with these snippets, and they're already 
at a pretty high priority, I think just leave them in the 60 block along with 
the more typical starting tags we have there now

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539590#comment-17539590
 ] 

Tim Allison commented on TIKA-3710:
---

Sounds good.  What do you think of breaking those out into a higher priority 
block as above?  Obv, we'll need to run this on a bunch of docs to see if this 
is overall a good change...

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582
 ] 

Nick Burch commented on TIKA-3710:
--

I was thinking we'd do (open)h1(close) or (open)h1(space) to cover both HTML 
cases but reduce the changes of a false positive match (+h2/h3)

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539580#comment-17539580
 ] 

Tim Allison commented on TIKA-3710:
---

This works on the test file:

{noformat}

  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  




  
  
  
  
  
  
  
  
  
  

{noformat}

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539574#comment-17539574
 ] 

Tim Allison edited comment on TIKA-3710 at 5/19/22 2:25 PM:


Sorry, that comment must have referred to the patterns in that block that 
allowed content before the html tags.  The patterns currently require the 
{{Is it valid for a message/rfc822 message to have a bunch of preamble like the 
>HTML tags in my document before the headers? 
My memory is that we've seen some crazy headers before the usual rfc822 
headers.  I do not think we've seen html tags in those.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539574#comment-17539574
 ] 

Tim Allison commented on TIKA-3710:
---

Sorry, that comment must have referred to the patterns in that block that 
allowed content before the html tags.  The patterns currently require the 
{{Is it valid for a message/rfc822 message to have a bunch of preamble like the 
>HTML tags in my document before the headers? 
My memory is that we've seen some crazy headers before the usual rfc822 
headers.  I do not think we've seen html tags in those.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [tika] dependabot[bot] commented on pull request #566: Bump solr-solrj from 8.11.1 to 9.0.0

2022-05-19 Thread GitBox


dependabot[bot] commented on PR #566:
URL: https://github.com/apache/tika/pull/566#issuecomment-1131754159

   OK, I won't notify you about version 9.x.x again, unless you re-open this PR 
or update to a 9.x.x release yourself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] tballison commented on pull request #566: Bump solr-solrj from 8.11.1 to 9.0.0

2022-05-19 Thread GitBox


tballison commented on PR #566:
URL: https://github.com/apache/tika/pull/566#issuecomment-1131754057

   @dependabot ignore this major version
   
   Solrj 9 requires Java 11.  We can't upgrade while we're still on Java 8.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] tballison closed pull request #566: Bump solr-solrj from 8.11.1 to 9.0.0

2022-05-19 Thread GitBox


tballison closed pull request #566: Bump solr-solrj from 8.11.1 to 9.0.0
URL: https://github.com/apache/tika/pull/566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3770) General upgrades for 1.28.3

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539498#comment-17539498
 ] 

Tim Allison commented on TIKA-3770:
---

I had to make some subtle changes in how we were calling one of the underlying 
dl4j libraries.  I can look at the commit history in main and cherrypick that 
into 1.x if anyone wants to update those dependencies.

> General upgrades for 1.28.3
> ---
>
> Key: TIKA-3770
> URL: https://issues.apache.org/jira/browse/TIKA-3770
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Automatic updates?

2022-05-19 Thread Tim Allison
I just click the button.  Is your username asfgit?  Kidding!  I have no
idea why your name isn't showing up.

On Thu, May 19, 2022 at 12:03 AM Tilman Hausherr 
wrote:

> What do you do differently than I do? I noticed that in the recent
> commit, your name appears in the PR, mine doesn't.
>
> https://github.com/apache/tika/pull/562/
> https://github.com/apache/tika/pull/563/
> "asfgit"
>
> https://github.com/apache/tika/pull/564/
> "tballison"
>
> Could it because I'm using my real mail address in commits and you're
> using your apache mail address?
>
> Tilman
>
> Am 18.05.2022 um 15:36 schrieb Tim Allison:
> > Oh, ok, phew.  Thank you, Tilman.  I remember seeing you merge some of
> the
> > others, and I agree, I'm not able to see that history now.
> >
> > As long as our AI overlords haven't taken control of our code without
> some
> > kind of manual review, all good.  Thank you.
> >
> > On Wed, May 18, 2022 at 8:37 AM Tilman Hausherr 
> > wrote:
> >
> >> The previous one (not this one) was me, I merged the branch locally and
> >> then pushed it.
> >>
> >> So there's still a manual step but somehow the history doesn't show
> this.
> >>
> >> Tilman
> >>
> >>
> >>
> >> --- Original-Nachricht ---
> >> Von: Tim Allison
> >> Betreff: Automatic updates?
> >> Datum: 18. Mai 2022, 14:25
> >> An: 
> >>
> >>
> >>
> >>
> >> @font-face { font-family: telegrotesk-medium_normal; src:
> >> url("file:///android_asset/fonts/telegrotesk_normal.ttf");}html,body {
> >> font-family: "telegrotesk-medium_normal"; font-size: medium; color:
> >> #4b4b4b; width: 100%;}
> >>
> >> All,
> >> It feels like something changed in the last week with our dependabot
> >> integration. We used to get PRs. Now we're getting PRs that are
> >> automatically merged.
> >> I don't think this is a great idea. What do you think?
> >>
> >> Best,
> >>
> >> Tim
> >>
> >> On Wed, May 18, 2022 at 1:55 AM GitBox  wrote:
> >>
> >>> dependabot[bot] opened a new pull request, #562:
> >>> URL: https://github.com/apache/tika/pull/562
> >>>
> >>> Bumps [zstd-jni](https://github.com/luben/zstd-jni) from 1.5.2-2 to
> >>> 1.5.2-3.
> >>> 
> >>> Commits
> >>> 
> >>> https://github.com/luben/zstd-jni/commit/c983ae3e086b63a40e1bb430cb2ebf95ecc52c71
> >> ">c983ae3;
> >>> Adjust signature comments after
> >>> e5c6a3290b8335db7c70877fda22ca26a96c72e4.
> >>> https://github.com/luben/zstd-jni/commit/510bbd6be80592227c6e5cf8cd8d71cb76c0c279
> >> ">510bbd6;
> >>> Add methods for streaming (de)compression of direct ByteBuffers.
> >>> https://github.com/luben/zstd-jni/commit/62b9dad49fc00f253cb35c1942c3ca6af4ee2b47
> >> ">62b9dad;
> >>> Fix lgtm C++.
> >>> https://github.com/luben/zstd-jni/commit/73ae46e1af16619143b7c87e35ad9c05363e2c97
> >> ">73ae46e;
> >>> v1.5.2-3
> >>> https://github.com/luben/zstd-jni/commit/e5c6a3290b8335db7c70877fda22ca26a96c72e4
> >> ">e5c6a32;
> >>> Fix overflows
> >>> https://github.com/luben/zstd-jni/commit/54d3d50c16d96bd8a30e2d4c0a9648001a52d6f9
> >> ">54d3d50;
> >>> Fix some error return codes.
> >>> https://github.com/luben/zstd-jni/commit/b788a2ed7a5e36e5252b1696e6cc8bae48a7afbc
> >> ">b788a2e;
> >>> Upgrade scala.
> >>> https://github.com/luben/zstd-jni/commit/31060934c26e080031465702ec369591e12874f8
> >> ">3106093;
> >>> Add NoFinalizer variants for the direct buffer streams.
> >>> See full diff in https://github.com/luben/zstd-jni/compare/v1.5.2-2...v1.5.2-3;>compare
> >>> view
> >>> 
> >>> 
> >>> 
> >>>
> >>>
> >>> [![Dependabot compatibility score](
> >>>
> >>
> https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.github.luben:zstd-jni=maven=1.5.2-2=1.5.2-3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores
> >>> )
> >>>
> >>> Dependabot will resolve any conflicts with this PR as long as you don't
> >>> alter it yourself. You can also trigger a rebase manually by commenting
> >>> `@dependabot rebase`.
> >>>
> >>> [//]: # (dependabot-automerge-start)
> >>> [//]: # (dependabot-automerge-end)
> >>>
> >>> ---
> >>>
> >>> 
> >>> Dependabot commands and options
> >>> 
> >>>
> >>> You can trigger Dependabot actions by commenting on this PR:
> >>> - `@dependabot rebase` will rebase this PR
> >>> - `@dependabot recreate` will recreate this PR, overwriting any edits
> >>> that have been made to it
> >>> - `@dependabot merge` will merge this PR after your CI passes on it
> >>> - `@dependabot squash and merge` will squash and merge this PR after
> >>> your CI passes on it
> >>> - `@dependabot cancel merge` will cancel a previously requested merge
> >>> and block automerging
> >>> - `@dependabot reopen` will reopen this PR if it is closed
> >>> - `@dependabot close` will close this PR and stop Dependabot recreating
> >>> it. You can achieve the same result by closing it manually
> >>> - `@dependabot ignore this major version` will close this PR and stop
> >>> Dependabot creating any more for this major version (unless you reopen
> >> the
> 

Final reminder: ApacheCon North America call for presentations closing soon

2022-05-19 Thread Rich Bowen
[Note: You're receiving this because you are subscribed to one or more
Apache Software Foundation project mailing lists.]

This is your final reminder that the Call for Presetations for
ApacheCon North America 2022 will close at 00:01 GMT on Monday, May
23rd, 2022. Please don't wait! Get your talk proposals in now!

Details here: https://apachecon.com/acna2022/cfp.html

--Rich, for the ApacheCon Planners




[jira] [Commented] (TIKA-3770) General upgrades for 1.28.3

2022-05-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539362#comment-17539362
 ] 

Hudson commented on TIKA-3770:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #207 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/207/])
TIKA-3770: update uimaj-core (tilman: 
[https://github.com/apache/tika/commit/4c4b92811c71788e9a275f7765fcca074b3c11ec])
* (edit) tika-parsers/pom.xml


> General upgrades for 1.28.3
> ---
>
> Key: TIKA-3770
> URL: https://issues.apache.org/jira/browse/TIKA-3770
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [tika] dependabot[bot] opened a new pull request, #566: Bump solr-solrj from 8.11.1 to 9.0.0

2022-05-19 Thread GitBox


dependabot[bot] opened a new pull request, #566:
URL: https://github.com/apache/tika/pull/566

   Bumps solr-solrj from 8.11.1 to 9.0.0.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.solr:solr-solrj=maven=8.11.1=9.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] asfgit merged pull request #565: Bump aws.version from 1.12.222 to 1.12.223

2022-05-19 Thread GitBox


asfgit merged PR #565:
URL: https://github.com/apache/tika/pull/565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org