[GitHub] [tika] longphan98 opened a new pull request, #552: TIKA-1800 -- decode the escape character before parsing it as a new p…

2022-04-25 Thread GitBox


longphan98 opened a new pull request, #552:
URL: https://github.com/apache/tika/pull/552

   …arameter
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters

2022-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527397#comment-17527397
 ] 

ASF GitHub Bot commented on TIKA-1800:
--

longphan98 opened a new pull request, #552:
URL: https://github.com/apache/tika/pull/552

   …arameter
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.1.0
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [tika] longphan98 commented on pull request #552: TIKA-1800 -- decode the escape character before parsing it as a new p…

2022-04-25 Thread GitBox


longphan98 commented on PR #552:
URL: https://github.com/apache/tika/pull/552#issuecomment-1108333512

   And also a test case too :3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters

2022-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527398#comment-17527398
 ] 

ASF GitHub Bot commented on TIKA-1800:
--

longphan98 commented on PR #552:
URL: https://github.com/apache/tika/pull/552#issuecomment-1108333512

   And also a test case too :3




> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.1.0
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527703#comment-17527703
 ] 

Tim Allison commented on TIKA-3719:
---

Stubbing toe now on this, [~tilman]. Again my apologies.

Separate topic.  I'd like to log a warning that this capability is in BETA and 
the configuration of it might change in future releases.  I want us to have the 
wiggle room to use the native cxf.xml instead of our hand-coded configuration 
going forward if that turns out to be a possibility.  The more we can offload 
to cxf, the better.

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527706#comment-17527706
 ] 

Dan Coldrick commented on TIKA-3725:


[~tallison]  I see you've got some responses from the CXF guys :) Great news

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527706#comment-17527706
 ] 

Dan Coldrick edited comment on TIKA-3725 at 4/25/22 6:34 PM:
-

[~tallison]  I see you've got some responses from the CXF guys :) Great news

Quick question is that thread only for apache people? i.e. not open to public?


was (Author: monkmachine):
[~tallison]  I see you've got some responses from the CXF guys :) Great news

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527711#comment-17527711
 ] 

Dan Coldrick commented on TIKA-3719:


[~tallison]  Just stick something in confluence, that's where I get all my info 
(as user) from about tika server.

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527713#comment-17527713
 ] 

Dan Coldrick commented on TIKA-3719:


Would also say if you want help with documenting stuff in confluence I'd be 
happy to help

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: tika-main windows build fails in TikaResourceFetcherTest

2022-04-25 Thread Tilman Hausherr
.replaceAll() is also used in ExternalParser.java with a filename 
parameter. But no tests fail because of it.


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527719#comment-17527719
 ] 

Hudson commented on TIKA-3719:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #521 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/521/])
TIKA-3719 -- fix tests on Windows (tallison: 
[https://github.com/apache/tika/commit/00c2614b1a1a4b236d3d697b42e82e3dcc1a9fd5])
* (edit) 
tika-server/tika-server-core/src/test/resources/configs/tika-config-server-tls-two-way-template.xml
* (edit) 
tika-server/tika-server-core/src/test/resources/configs/tika-config-server-tls-one-way-template.xml
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java
TIKA-3719 -- fix tests on Windows (tallison: 
[https://github.com/apache/tika/commit/0f7d9263df1aa272ada1a4d150c35892721c2091])
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java


> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527726#comment-17527726
 ] 

Tim Allison commented on TIKA-3719:
---

[~monkmachine], what's your user name on confluence?  We're happy to grant 
write access.

Are you ok w BETA status?  If we find out from cxf team, that users can 
configure tls and/or auth via the cxf.xml file, then I'd really like to offload 
that and remove the code we just added.

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: tika-main windows build fails in TikaResourceFetcherTest

2022-04-25 Thread Tim Allison
Thank you for catching this, Tilman.  I do get a test failure on my
windows laptop after I installed exiftool. :(  Will fix.

On Mon, Apr 25, 2022 at 2:45 PM Tilman Hausherr  wrote:
>
> .replaceAll() is also used in ExternalParser.java with a filename
> parameter. But no tests fail because of it.


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527747#comment-17527747
 ] 

Hudson commented on TIKA-3719:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #522 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/522/])
TIKA-3719 -- log warning about beta stage of tls configuration (tallison: 
[https://github.com/apache/tika/commit/b8669229f28ffb71977d573e17d0bffc6578a8ef])
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java
* (edit) CHANGES.txt


> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3730) New ExternalParser doesn't work on Windows

2022-04-25 Thread Tim Allison (Jira)
Tim Allison created TIKA-3730:
-

 Summary: New ExternalParser doesn't work on Windows
 Key: TIKA-3730
 URL: https://issues.apache.org/jira/browse/TIKA-3730
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


[~tilman] noted that the external2.ExternalParser uses "replaceAll" on a regex 
where the replacement is a file path does not work on Windows.  The replaceAll 
strips the file separators.  I admit that I cannot figure out why this is is 
happening.  I've tried a couple of combinations of backslashing etc, but 
nothing is working.  I even tried Pattern.quote() and that doesn't work on 
Windows. 

If we back off to use "replace" with a string, everything seems to work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3730) New ExternalParser doesn't work on Windows

2022-04-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3730.
---
Fix Version/s: 2.4.0
   Resolution: Fixed

> New ExternalParser doesn't work on Windows
> --
>
> Key: TIKA-3730
> URL: https://issues.apache.org/jira/browse/TIKA-3730
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.0
>
>
> [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a 
> regex where the replacement is a file path does not work on Windows.  The 
> replaceAll strips the file separators.  I admit that I cannot figure out why 
> this is is happening.  I've tried a couple of combinations of backslashing 
> etc, but nothing is working.  I even tried Pattern.quote() and that doesn't 
> work on Windows. 
> If we back off to use "replace" with a string, everything seems to work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3730) New ExternalParser doesn't work on Windows

2022-04-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3730:
--
Priority: Trivial  (was: Major)

> New ExternalParser doesn't work on Windows
> --
>
> Key: TIKA-3730
> URL: https://issues.apache.org/jira/browse/TIKA-3730
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.4.0
>
>
> [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a 
> regex where the replacement is a file path does not work on Windows.  The 
> replaceAll strips the file separators.  I admit that I cannot figure out why 
> this is is happening.  I've tried a couple of combinations of backslashing 
> etc, but nothing is working.  I even tried Pattern.quote() and that doesn't 
> work on Windows. 
> If we back off to use "replace" with a string, everything seems to work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [tika] Gagravarr commented on pull request #552: TIKA-1800 -- decode the escape character before parsing it as a new p…

2022-04-25 Thread GitBox


Gagravarr commented on PR #552:
URL: https://github.com/apache/tika/pull/552#issuecomment-1109073795

   Your commit seems to remove a test zip file, which seems to be by accident, 
any chance you could re-do it without the accidental deletion?
   
   Would you also be able to add a comment explaining what your new for loop is 
doing, so anyone looking at that code later can quickly figure out what it's 
doing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters

2022-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527775#comment-17527775
 ] 

ASF GitHub Bot commented on TIKA-1800:
--

Gagravarr commented on PR #552:
URL: https://github.com/apache/tika/pull/552#issuecomment-1109073795

   Your commit seems to remove a test zip file, which seems to be by accident, 
any chance you could re-do it without the accidental deletion?
   
   Would you also be able to add a comment explaining what your new for loop is 
doing, so anyone looking at that code later can quickly figure out what it's 
doing?




> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
>Priority: Major
> Fix For: 1.17, 2.0.0-BETA, 2.1.0
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527783#comment-17527783
 ] 

Dan Coldrick commented on TIKA-3719:


Hi [~tallison] 

Yes happy with beta, be really good if the CXF guys can have a review (which 
looks like they are going to) and extend to take cxf.xml files with all that 
entails. Honestly can't thank you enough for the help you've provided. :)

My Confluence name is Dan Coldrick

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527784#comment-17527784
 ] 

Dan Coldrick commented on TIKA-3719:


Also is it possible to link to confluence from the main tika page and make it 
stand out more? Confluence has a lot more detail than the main tika page which 
I've always found to be more useful (might also help I'm a massive fan of 
confluence)

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks
>
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-25 Thread Dan Coldrick (Jira)
Dan Coldrick created TIKA-3731:
--

 Summary: Tika CAD DWG reader not pulling meta data from new cad 
files
 Key: TIKA-3731
 URL: https://issues.apache.org/jira/browse/TIKA-3731
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 2.3.0
Reporter: Dan Coldrick


 

The tika DWG reader is only pulling meta data from up to drawing format AC1024  
(see code snippet) where it looks to be AC1027 & AC1032 can also be read from 
the same get2007and2010Props meta data extractor.
{code:java}

 switch (version) {
            case "AC1015":
                metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
                if (skipTo2000PropertyInfoSection(stream, header)) {
                    get2000Props(stream, metadata, xhtml);
                }
                break;
            case "AC1018":
                metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
                if (skipToPropertyInfoSection(stream, header)) {
                    get2004Props(stream, metadata, xhtml);
                }
                break;
            case "AC1021":
            case "AC1024":
                metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
                if (skipToPropertyInfoSection(stream, header)) {
                    get2007and2010Props(stream, metadata, xhtml);
                }
                break;
            default:
                throw new TikaException("Unsupported AutoCAD drawing version: " 
+ version);
        } {code}
Looks like the case statement just needs extending and for examples files to be 
created for AC1027/AC1032. 

Current versions of auto cad can be found here:

https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527786#comment-17527786
 ] 

Dan Coldrick commented on TIKA-3731:


related to https://issues.apache.org/jira/browse/TIKA-1735 but that looked to 
also try to include a parser so thought it would be good to split the two 
issues and get the bug fixed. 

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Major
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-25 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3731:
---
Attachment: testDWG-AC1027.dwg

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Major
> Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg
>
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-25 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3731:
---
Attachment: AutoCAD 2018 format (1).dwg

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Major
> Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg
>
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-25 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527787#comment-17527787
 ] 

Dan Coldrick commented on TIKA-3731:


I've attached a AC1027 and AC1032 dwg to extend the tests.

> Tika CAD DWG reader not pulling meta data from new cad files
> 
>
> Key: TIKA-3731
> URL: https://issues.apache.org/jira/browse/TIKA-3731
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Major
> Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg
>
>
>  
> The tika DWG reader is only pulling meta data from up to drawing format 
> AC1024  (see code snippet) where it looks to be AC1027 & AC1032 can also be 
> read from the same get2007and2010Props meta data extractor.
> {code:java}
>  switch (version) {
>             case "AC1015":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipTo2000PropertyInfoSection(stream, header)) {
>                     get2000Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1018":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2004Props(stream, metadata, xhtml);
>                 }
>                 break;
>             case "AC1021":
>             case "AC1024":
>                 metadata.set(Metadata.CONTENT_TYPE, TYPE.toString());
>                 if (skipToPropertyInfoSection(stream, header)) {
>                     get2007and2010Props(stream, metadata, xhtml);
>                 }
>                 break;
>             default:
>                 throw new TikaException("Unsupported AutoCAD drawing version: 
> " + version);
>         } {code}
> Looks like the case statement just needs extending and for examples files to 
> be created for AC1027/AC1032. 
> Current versions of auto cad can be found here:
> https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3730) New ExternalParser doesn't work on Windows

2022-04-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527798#comment-17527798
 ] 

Hudson commented on TIKA-3730:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #524 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/524/])
TIKA-3730 (tallison: 
[https://github.com/apache/tika/commit/4639e8d3712fa015bcecdb1e6b89e8bd9e5e67fa])
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/external2/ExternalParser.java
* (edit) 
tika-core/src/test/java/org/apache/tika/parser/external2/ExternalParserTest.java
TIKA-3730 -- fix checkstyle; hang head in shame. (tallison: 
[https://github.com/apache/tika/commit/90c7e4c2d0f1ae1b5a8e559b2955820a5d743046])
* (edit) 
tika-core/src/test/java/org/apache/tika/parser/external2/ExternalParserTest.java


> New ExternalParser doesn't work on Windows
> --
>
> Key: TIKA-3730
> URL: https://issues.apache.org/jira/browse/TIKA-3730
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.4.0
>
>
> [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a 
> regex where the replacement is a file path does not work on Windows.  The 
> replaceAll strips the file separators.  I admit that I cannot figure out why 
> this is is happening.  I've tried a couple of combinations of backslashing 
> etc, but nothing is working.  I even tried Pattern.quote() and that doesn't 
> work on Windows. 
> If we back off to use "replace" with a string, everything seems to work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3732) Word doc MediaType detected as RTF

2022-04-25 Thread Caleb Postlethwait (Jira)
Caleb Postlethwait created TIKA-3732:


 Summary: Word doc MediaType detected as RTF
 Key: TIKA-3732
 URL: https://issues.apache.org/jira/browse/TIKA-3732
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 2.2.1
Reporter: Caleb Postlethwait
 Attachments: example.DOC

When executing Detector.detect(InputStream input, Metadata metadata) on a 
particular Word document, we're getting back a MediaType of RTF which has some 
downstream effects for us.
Here's the relevant bit of code:



TikaConfig config = TikaConfigFactory.getTikaConfig();
Detector detector = config.getDetector();
Metadata metadata = new Metadata();
stream = TikaInputStream.get(fis = new FileInputStream(paths));
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths);
*MediaType mediaType = detector.detect(stream, metadata);*





Attaching the file that we came across this issue on.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3732) Word doc MediaType detected as RTF

2022-04-25 Thread Ross Johnson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527822#comment-17527822
 ] 

Ross Johnson commented on TIKA-3732:


I took a quick look at the attached file in a hex editor and can confirm that 
it is indeed an RTF file despite the file extension being .DOC. It appears that 
Tika is detecting the type correctly.

> Word doc MediaType detected as RTF
> --
>
> Key: TIKA-3732
> URL: https://issues.apache.org/jira/browse/TIKA-3732
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.2.1
>Reporter: Caleb Postlethwait
>Priority: Major
> Attachments: example.DOC
>
>
> When executing Detector.detect(InputStream input, Metadata metadata) on a 
> particular Word document, we're getting back a MediaType of RTF which has 
> some downstream effects for us.
> Here's the relevant bit of code:
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> stream = TikaInputStream.get(fis = new FileInputStream(paths));
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths);
> *MediaType mediaType = detector.detect(stream, metadata);*
> Attaching the file that we came across this issue on.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)