[GitHub] [tika] longphan98 opened a new pull request, #552: TIKA-1800 -- decode the escape character before parsing it as a new p…
longphan98 opened a new pull request, #552: URL: https://github.com/apache/tika/pull/552 …arameter Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters
[ https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527397#comment-17527397 ] ASF GitHub Bot commented on TIKA-1800: -- longphan98 opened a new pull request, #552: URL: https://github.com/apache/tika/pull/552 …arameter Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > MediaType#parse does not decode escaped special characters > -- > > Key: TIKA-1800 > URL: https://issues.apache.org/jira/browse/TIKA-1800 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Major > Fix For: 1.17, 2.0.0-BETA, 2.1.0 > > > Special characters in parameter value are escaped in canonical string > representation but they are not unescaped when the canonical string > representation is parsed. > {code:java} > MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", > "#report@"); > String cType = mType.toString(); // application/xml; x-report="#report\@" > assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success > mType = MediaType.parse(cType); > String report = mType.getParameters().get("x-report"); // #report\@ > assertEquals("#report@", report); // failure > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [tika] longphan98 commented on pull request #552: TIKA-1800 -- decode the escape character before parsing it as a new p…
longphan98 commented on PR #552: URL: https://github.com/apache/tika/pull/552#issuecomment-1108333512 And also a test case too :3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters
[ https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527398#comment-17527398 ] ASF GitHub Bot commented on TIKA-1800: -- longphan98 commented on PR #552: URL: https://github.com/apache/tika/pull/552#issuecomment-1108333512 And also a test case too :3 > MediaType#parse does not decode escaped special characters > -- > > Key: TIKA-1800 > URL: https://issues.apache.org/jira/browse/TIKA-1800 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Major > Fix For: 1.17, 2.0.0-BETA, 2.1.0 > > > Special characters in parameter value are escaped in canonical string > representation but they are not unescaped when the canonical string > representation is parsed. > {code:java} > MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", > "#report@"); > String cType = mType.toString(); // application/xml; x-report="#report\@" > assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success > mType = MediaType.parse(cType); > String report = mType.getParameters().get("x-report"); // #report\@ > assertEquals("#report@", report); // failure > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527703#comment-17527703 ] Tim Allison commented on TIKA-3719: --- Stubbing toe now on this, [~tilman]. Again my apologies. Separate topic. I'd like to log a warning that this capability is in BETA and the configuration of it might change in future releases. I want us to have the wiggle room to use the native cxf.xml instead of our hand-coded configuration going forward if that turns out to be a possibility. The more we can offload to cxf, the better. > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527706#comment-17527706 ] Dan Coldrick commented on TIKA-3725: [~tallison] I see you've got some responses from the CXF guys :) Great news > Add Authorization to Tika Server (Suggest Basic to start off with) > -- > > Key: TIKA-3725 > URL: https://issues.apache.org/jira/browse/TIKA-3725 > Project: Tika > Issue Type: New Feature > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Minor > > I would be good to get some Authentication/Authorization added to TIKA server > to be able to add another layer of security around the Tika Server Rest > service. > This could become a rabbit hole with the number of options available around > Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter > basic Auth is added. > How to store user(s)/password suggest looking at how other apache products do > the same? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527706#comment-17527706 ] Dan Coldrick edited comment on TIKA-3725 at 4/25/22 6:34 PM: - [~tallison] I see you've got some responses from the CXF guys :) Great news Quick question is that thread only for apache people? i.e. not open to public? was (Author: monkmachine): [~tallison] I see you've got some responses from the CXF guys :) Great news > Add Authorization to Tika Server (Suggest Basic to start off with) > -- > > Key: TIKA-3725 > URL: https://issues.apache.org/jira/browse/TIKA-3725 > Project: Tika > Issue Type: New Feature > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Minor > > I would be good to get some Authentication/Authorization added to TIKA server > to be able to add another layer of security around the Tika Server Rest > service. > This could become a rabbit hole with the number of options available around > Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter > basic Auth is added. > How to store user(s)/password suggest looking at how other apache products do > the same? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527711#comment-17527711 ] Dan Coldrick commented on TIKA-3719: [~tallison] Just stick something in confluence, that's where I get all my info (as user) from about tika server. > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527713#comment-17527713 ] Dan Coldrick commented on TIKA-3719: Would also say if you want help with documenting stuff in confluence I'd be happy to help > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: tika-main windows build fails in TikaResourceFetcherTest
.replaceAll() is also used in ExternalParser.java with a filename parameter. But no tests fail because of it.
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527719#comment-17527719 ] Hudson commented on TIKA-3719: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #521 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/521/]) TIKA-3719 -- fix tests on Windows (tallison: [https://github.com/apache/tika/commit/00c2614b1a1a4b236d3d697b42e82e3dcc1a9fd5]) * (edit) tika-server/tika-server-core/src/test/resources/configs/tika-config-server-tls-two-way-template.xml * (edit) tika-server/tika-server-core/src/test/resources/configs/tika-config-server-tls-one-way-template.xml * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java TIKA-3719 -- fix tests on Windows (tallison: [https://github.com/apache/tika/commit/0f7d9263df1aa272ada1a4d150c35892721c2091]) * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaServerIntegrationTest.java > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527726#comment-17527726 ] Tim Allison commented on TIKA-3719: --- [~monkmachine], what's your user name on confluence? We're happy to grant write access. Are you ok w BETA status? If we find out from cxf team, that users can configure tls and/or auth via the cxf.xml file, then I'd really like to offload that and remove the code we just added. > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: tika-main windows build fails in TikaResourceFetcherTest
Thank you for catching this, Tilman. I do get a test failure on my windows laptop after I installed exiftool. :( Will fix. On Mon, Apr 25, 2022 at 2:45 PM Tilman Hausherr wrote: > > .replaceAll() is also used in ExternalParser.java with a filename > parameter. But no tests fail because of it.
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527747#comment-17527747 ] Hudson commented on TIKA-3719: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #522 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/522/]) TIKA-3719 -- log warning about beta stage of tls configuration (tallison: [https://github.com/apache/tika/commit/b8669229f28ffb71977d573e17d0bffc6578a8ef]) * (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java * (edit) CHANGES.txt > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (TIKA-3730) New ExternalParser doesn't work on Windows
Tim Allison created TIKA-3730: - Summary: New ExternalParser doesn't work on Windows Key: TIKA-3730 URL: https://issues.apache.org/jira/browse/TIKA-3730 Project: Tika Issue Type: Task Reporter: Tim Allison [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a regex where the replacement is a file path does not work on Windows. The replaceAll strips the file separators. I admit that I cannot figure out why this is is happening. I've tried a couple of combinations of backslashing etc, but nothing is working. I even tried Pattern.quote() and that doesn't work on Windows. If we back off to use "replace" with a string, everything seems to work. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (TIKA-3730) New ExternalParser doesn't work on Windows
[ https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3730. --- Fix Version/s: 2.4.0 Resolution: Fixed > New ExternalParser doesn't work on Windows > -- > > Key: TIKA-3730 > URL: https://issues.apache.org/jira/browse/TIKA-3730 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.4.0 > > > [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a > regex where the replacement is a file path does not work on Windows. The > replaceAll strips the file separators. I admit that I cannot figure out why > this is is happening. I've tried a couple of combinations of backslashing > etc, but nothing is working. I even tried Pattern.quote() and that doesn't > work on Windows. > If we back off to use "replace" with a string, everything seems to work. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (TIKA-3730) New ExternalParser doesn't work on Windows
[ https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3730: -- Priority: Trivial (was: Major) > New ExternalParser doesn't work on Windows > -- > > Key: TIKA-3730 > URL: https://issues.apache.org/jira/browse/TIKA-3730 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.4.0 > > > [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a > regex where the replacement is a file path does not work on Windows. The > replaceAll strips the file separators. I admit that I cannot figure out why > this is is happening. I've tried a couple of combinations of backslashing > etc, but nothing is working. I even tried Pattern.quote() and that doesn't > work on Windows. > If we back off to use "replace" with a string, everything seems to work. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [tika] Gagravarr commented on pull request #552: TIKA-1800 -- decode the escape character before parsing it as a new p…
Gagravarr commented on PR #552: URL: https://github.com/apache/tika/pull/552#issuecomment-1109073795 Your commit seems to remove a test zip file, which seems to be by accident, any chance you could re-do it without the accidental deletion? Would you also be able to add a comment explaining what your new for loop is doing, so anyone looking at that code later can quickly figure out what it's doing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-1800) MediaType#parse does not decode escaped special characters
[ https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527775#comment-17527775 ] ASF GitHub Bot commented on TIKA-1800: -- Gagravarr commented on PR #552: URL: https://github.com/apache/tika/pull/552#issuecomment-1109073795 Your commit seems to remove a test zip file, which seems to be by accident, any chance you could re-do it without the accidental deletion? Would you also be able to add a comment explaining what your new for loop is doing, so anyone looking at that code later can quickly figure out what it's doing? > MediaType#parse does not decode escaped special characters > -- > > Key: TIKA-1800 > URL: https://issues.apache.org/jira/browse/TIKA-1800 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti >Priority: Major > Fix For: 1.17, 2.0.0-BETA, 2.1.0 > > > Special characters in parameter value are escaped in canonical string > representation but they are not unescaped when the canonical string > representation is parsed. > {code:java} > MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", > "#report@"); > String cType = mType.toString(); // application/xml; x-report="#report\@" > assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success > mType = MediaType.parse(cType); > String report = mType.getParameters().get("x-report"); // #report\@ > assertEquals("#report@", report); // failure > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527783#comment-17527783 ] Dan Coldrick commented on TIKA-3719: Hi [~tallison] Yes happy with beta, be really good if the CXF guys can have a review (which looks like they are going to) and extend to take cxf.xml files with all that entails. Honestly can't thank you enough for the help you've provided. :) My Confluence name is Dan Coldrick > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527784#comment-17527784 ] Dan Coldrick commented on TIKA-3719: Also is it possible to link to confluence from the main tika page and make it stand out more? Confluence has a lot more detail than the main tika page which I've always found to be more useful (might also help I'm a massive fan of confluence) > Tika Server Ability to Run HTTPs > > > Key: TIKA-3719 > URL: https://issues.apache.org/jira/browse/TIKA-3719 > Project: Tika > Issue Type: Wish > Components: tika-server >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Assignee: Tim Allison >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2022-04-21-18-52-50-706.png, localhost.jks > > > We need the ability to run TIKA server as a https end point, I can't see > anything in the config that allows for this. > Looks like I'm not the only one: > [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https] > > If anyone can point to some documentation on how it might be possible it > would be really appreciated. > > Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files
Dan Coldrick created TIKA-3731: -- Summary: Tika CAD DWG reader not pulling meta data from new cad files Key: TIKA-3731 URL: https://issues.apache.org/jira/browse/TIKA-3731 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 2.3.0 Reporter: Dan Coldrick The tika DWG reader is only pulling meta data from up to drawing format AC1024 (see code snippet) where it looks to be AC1027 & AC1032 can also be read from the same get2007and2010Props meta data extractor. {code:java} switch (version) { case "AC1015": metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); if (skipTo2000PropertyInfoSection(stream, header)) { get2000Props(stream, metadata, xhtml); } break; case "AC1018": metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); if (skipToPropertyInfoSection(stream, header)) { get2004Props(stream, metadata, xhtml); } break; case "AC1021": case "AC1024": metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); if (skipToPropertyInfoSection(stream, header)) { get2007and2010Props(stream, metadata, xhtml); } break; default: throw new TikaException("Unsupported AutoCAD drawing version: " + version); } {code} Looks like the case statement just needs extending and for examples files to be created for AC1027/AC1032. Current versions of auto cad can be found here: https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527786#comment-17527786 ] Dan Coldrick commented on TIKA-3731: related to https://issues.apache.org/jira/browse/TIKA-1735 but that looked to also try to include a parser so thought it would be good to split the two issues and get the bug fixed. > Tika CAD DWG reader not pulling meta data from new cad files > > > Key: TIKA-3731 > URL: https://issues.apache.org/jira/browse/TIKA-3731 > Project: Tika > Issue Type: Bug > Components: metadata >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Major > > > The tika DWG reader is only pulling meta data from up to drawing format > AC1024 (see code snippet) where it looks to be AC1027 & AC1032 can also be > read from the same get2007and2010Props meta data extractor. > {code:java} > switch (version) { > case "AC1015": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipTo2000PropertyInfoSection(stream, header)) { > get2000Props(stream, metadata, xhtml); > } > break; > case "AC1018": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2004Props(stream, metadata, xhtml); > } > break; > case "AC1021": > case "AC1024": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2007and2010Props(stream, metadata, xhtml); > } > break; > default: > throw new TikaException("Unsupported AutoCAD drawing version: > " + version); > } {code} > Looks like the case statement just needs extending and for examples files to > be created for AC1027/AC1032. > Current versions of auto cad can be found here: > https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Coldrick updated TIKA-3731: --- Attachment: testDWG-AC1027.dwg > Tika CAD DWG reader not pulling meta data from new cad files > > > Key: TIKA-3731 > URL: https://issues.apache.org/jira/browse/TIKA-3731 > Project: Tika > Issue Type: Bug > Components: metadata >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Major > Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg > > > > The tika DWG reader is only pulling meta data from up to drawing format > AC1024 (see code snippet) where it looks to be AC1027 & AC1032 can also be > read from the same get2007and2010Props meta data extractor. > {code:java} > switch (version) { > case "AC1015": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipTo2000PropertyInfoSection(stream, header)) { > get2000Props(stream, metadata, xhtml); > } > break; > case "AC1018": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2004Props(stream, metadata, xhtml); > } > break; > case "AC1021": > case "AC1024": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2007and2010Props(stream, metadata, xhtml); > } > break; > default: > throw new TikaException("Unsupported AutoCAD drawing version: > " + version); > } {code} > Looks like the case statement just needs extending and for examples files to > be created for AC1027/AC1032. > Current versions of auto cad can be found here: > https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Coldrick updated TIKA-3731: --- Attachment: AutoCAD 2018 format (1).dwg > Tika CAD DWG reader not pulling meta data from new cad files > > > Key: TIKA-3731 > URL: https://issues.apache.org/jira/browse/TIKA-3731 > Project: Tika > Issue Type: Bug > Components: metadata >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Major > Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg > > > > The tika DWG reader is only pulling meta data from up to drawing format > AC1024 (see code snippet) where it looks to be AC1027 & AC1032 can also be > read from the same get2007and2010Props meta data extractor. > {code:java} > switch (version) { > case "AC1015": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipTo2000PropertyInfoSection(stream, header)) { > get2000Props(stream, metadata, xhtml); > } > break; > case "AC1018": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2004Props(stream, metadata, xhtml); > } > break; > case "AC1021": > case "AC1024": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2007and2010Props(stream, metadata, xhtml); > } > break; > default: > throw new TikaException("Unsupported AutoCAD drawing version: > " + version); > } {code} > Looks like the case statement just needs extending and for examples files to > be created for AC1027/AC1032. > Current versions of auto cad can be found here: > https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527787#comment-17527787 ] Dan Coldrick commented on TIKA-3731: I've attached a AC1027 and AC1032 dwg to extend the tests. > Tika CAD DWG reader not pulling meta data from new cad files > > > Key: TIKA-3731 > URL: https://issues.apache.org/jira/browse/TIKA-3731 > Project: Tika > Issue Type: Bug > Components: metadata >Affects Versions: 2.3.0 >Reporter: Dan Coldrick >Priority: Major > Attachments: AutoCAD 2018 format (1).dwg, testDWG-AC1027.dwg > > > > The tika DWG reader is only pulling meta data from up to drawing format > AC1024 (see code snippet) where it looks to be AC1027 & AC1032 can also be > read from the same get2007and2010Props meta data extractor. > {code:java} > switch (version) { > case "AC1015": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipTo2000PropertyInfoSection(stream, header)) { > get2000Props(stream, metadata, xhtml); > } > break; > case "AC1018": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2004Props(stream, metadata, xhtml); > } > break; > case "AC1021": > case "AC1024": > metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); > if (skipToPropertyInfoSection(stream, header)) { > get2007and2010Props(stream, metadata, xhtml); > } > break; > default: > throw new TikaException("Unsupported AutoCAD drawing version: > " + version); > } {code} > Looks like the case statement just needs extending and for examples files to > be created for AC1027/AC1032. > Current versions of auto cad can be found here: > https://knowledge.autodesk.com/support/autocad/learn-explore/caas/sfdcarticles/sfdcarticles/drawing-version-codes-for-autocad.html > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3730) New ExternalParser doesn't work on Windows
[ https://issues.apache.org/jira/browse/TIKA-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527798#comment-17527798 ] Hudson commented on TIKA-3730: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #524 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/524/]) TIKA-3730 (tallison: [https://github.com/apache/tika/commit/4639e8d3712fa015bcecdb1e6b89e8bd9e5e67fa]) * (edit) tika-core/src/main/java/org/apache/tika/parser/external2/ExternalParser.java * (edit) tika-core/src/test/java/org/apache/tika/parser/external2/ExternalParserTest.java TIKA-3730 -- fix checkstyle; hang head in shame. (tallison: [https://github.com/apache/tika/commit/90c7e4c2d0f1ae1b5a8e559b2955820a5d743046]) * (edit) tika-core/src/test/java/org/apache/tika/parser/external2/ExternalParserTest.java > New ExternalParser doesn't work on Windows > -- > > Key: TIKA-3730 > URL: https://issues.apache.org/jira/browse/TIKA-3730 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.4.0 > > > [~tilman] noted that the external2.ExternalParser uses "replaceAll" on a > regex where the replacement is a file path does not work on Windows. The > replaceAll strips the file separators. I admit that I cannot figure out why > this is is happening. I've tried a couple of combinations of backslashing > etc, but nothing is working. I even tried Pattern.quote() and that doesn't > work on Windows. > If we back off to use "replace" with a string, everything seems to work. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (TIKA-3732) Word doc MediaType detected as RTF
Caleb Postlethwait created TIKA-3732: Summary: Word doc MediaType detected as RTF Key: TIKA-3732 URL: https://issues.apache.org/jira/browse/TIKA-3732 Project: Tika Issue Type: Bug Components: detector Affects Versions: 2.2.1 Reporter: Caleb Postlethwait Attachments: example.DOC When executing Detector.detect(InputStream input, Metadata metadata) on a particular Word document, we're getting back a MediaType of RTF which has some downstream effects for us. Here's the relevant bit of code: TikaConfig config = TikaConfigFactory.getTikaConfig(); Detector detector = config.getDetector(); Metadata metadata = new Metadata(); stream = TikaInputStream.get(fis = new FileInputStream(paths)); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths); *MediaType mediaType = detector.detect(stream, metadata);* Attaching the file that we came across this issue on. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3732) Word doc MediaType detected as RTF
[ https://issues.apache.org/jira/browse/TIKA-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527822#comment-17527822 ] Ross Johnson commented on TIKA-3732: I took a quick look at the attached file in a hex editor and can confirm that it is indeed an RTF file despite the file extension being .DOC. It appears that Tika is detecting the type correctly. > Word doc MediaType detected as RTF > -- > > Key: TIKA-3732 > URL: https://issues.apache.org/jira/browse/TIKA-3732 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.2.1 >Reporter: Caleb Postlethwait >Priority: Major > Attachments: example.DOC > > > When executing Detector.detect(InputStream input, Metadata metadata) on a > particular Word document, we're getting back a MediaType of RTF which has > some downstream effects for us. > Here's the relevant bit of code: > TikaConfig config = TikaConfigFactory.getTikaConfig(); > Detector detector = config.getDetector(); > Metadata metadata = new Metadata(); > stream = TikaInputStream.get(fis = new FileInputStream(paths)); > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths); > *MediaType mediaType = detector.detect(stream, metadata);* > Attaching the file that we came across this issue on. -- This message was sent by Atlassian Jira (v8.20.7#820007)