[jira] [Commented] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11
[ https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351272#comment-16351272 ] ASF GitHub Bot commented on TIKA-2565: -- lewismc opened a new pull request #218: TIKA-2565 Upgrade edu.ucar dependencies to 4.6.11 URL: https://github.com/apache/tika/pull/218 This issue addresses https://issues.apache.org/jira/browse/TIKA-2565 and supersedes https://github.com/apache/tika/pull/212 @med-ali-bannour FYI This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade edu.ucar dependencies to 4.6.11 > --- > > Key: TIKA-2565 > URL: https://issues.apache.org/jira/browse/TIKA-2565 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.17 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 2.0 > > > An [existing PR|https://github.com/apache/tika/pull/212/files] suggests to > upgrade the netcdf4-java dependency, however it does not address the issue. > This PR will add the correct Maven repository configuration and then make the > upgrade(s). > https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/BuildDependencies.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11
Lewis John McGibbney created TIKA-2565: -- Summary: Upgrade edu.ucar dependencies to 4.6.11 Key: TIKA-2565 URL: https://issues.apache.org/jira/browse/TIKA-2565 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.17 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.0 An [existing PR|https://github.com/apache/tika/pull/212/files] suggests to upgrade the netcdf4-java dependency, however it does not address the issue. This PR will add the correct Maven repository configuration and then make the upgrade(s). https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/BuildDependencies.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351224#comment-16351224 ] NW Brad commented on TIKA-2562: --- I was doing some research on this today and this may not be a function of Tika. I think it is probably the SAXTransformerFactory (javax.xml.transform) that is making the change. At least I could find any code in Tika that did it directly. But anything I ran through the SAXTransformerFactory converted the HTML I provided with void (empty) elements and self-closing start tags as shown below: http://www.google.com";> *becomes* http://www.google.com*"/>* and *becomes* . >From an XML standpoint the converted syntax is correct, but the anchor tag >code while correct in XML, does not appear to work correctly as HTML in both >the current version of Chrome and Firefox. So, converting HTML via Tika in >this situation generates bad HTML for the examples I have. I believe the SAXTransformerFactory is also deleting the that is around the "empty" anchor tag since a div around nothing is may not be consider relevant. I least that is what I speculate... h1. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2564) Tika client cannot extract files from embedded archive formats
Marc Prud'hommeaux created TIKA-2564: Summary: Tika client cannot extract files from embedded archive formats Key: TIKA-2564 URL: https://issues.apache.org/jira/browse/TIKA-2564 Project: Tika Issue Type: Bug Environment: Mac OS 10.13.3 (17D47) 17:42 ext$ java -version java version "9.0.1" Java(TM) SE Runtime Environment (build 9.0.1+11) Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode) 17:42 ext$ uname -a Darwin bix.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64 Reporter: Marc Prud'hommeaux This may be related to TIKA-2395. When trying to extract the files from tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- --extract test-documents.tgz I see the exception: Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at coursier.cli.qR.a(Unknown Source) at coursier.cli.qQ.j(Unknown Source) at coursier.cli.qW.a(Unknown Source) at d.h.a.c(Unknown Source) at b.b.c_(Unknown Source) at d.b.d.E.g(Unknown Source) at d.b.e.aW.g(Unknown Source) at d.b.f.b.aa.a(Unknown Source) at coursier.cli.qQ.b(Unknown Source) at coursier.cli.Q.b(Unknown Source) at b.J.c_(Unknown Source) at d.F.h(Unknown Source) at b.F.a(Unknown Source) at coursier.cli.Coursier.main(Unknown Source) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at coursier.Bootstrap.main(Bootstrap.java:428) Caused by: java.io.IOException: mark/reset not supported at java.base/java.io.InputStream.reset(InputStream.java:474) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045) at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 28 more However, I can browse the document fine using: % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- test-documents.tgz This issue affects: test-documents.rar, test-documents.tar.Z, test-documents.tbz2, and test-documents.tgz But it does not affect test-documents.7z, test-documents.cab, test-documents.ddf, test-documents.dmg, test-documents.tar, or test-documents.zip This makes me suspect that it has something to do with extracting files from packages that are embedded in other archive parsers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350822#comment-16350822 ] Tim Allison commented on TIKA-2562: --- I'll take a look. This will require some digging. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634 ] NW Brad edited comment on TIKA-2562 at 2/2/18 4:51 PM: --- Thanks. I checked it out and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup added the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] Tika parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. was (Author: nwbrad): Thanks. I checked it out and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] Tika parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634 ] NW Brad edited comment on TIKA-2562 at 2/2/18 4:50 PM: --- Thanks. I checked it out and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] Tika parse results: http://www.google.com";>[http://www.google.com|http://www.google.com/] The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. was (Author: nwbrad): Thanks. I check it out, it and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>http://www.google.com Tika parse results: http://www.google.com";>http://www.google.com The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634 ] NW Brad commented on TIKA-2562: --- Thanks. I check it out, it and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>http://www.google.com Tika parse results: http://www.google.com";>http://www.google.com The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Description: Files (esp images) and other objects can be embedded in html/css/javascript with the {{data: uri scheme}}. We should extract those like any other embedded file. (was: Files (esp images) can be base64 encoded in HTML files. We should extract those like any other embedded file.) > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) and other objects can be embedded in html/css/javascript > with the {{data: uri scheme}}. We should extract those like any other > embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Description: Files (esp images) and other objects can be embedded in html/css/javascript with the [data: uri scheme|https://en.wikipedia.org/wiki/Data_URI_scheme]. We should extract those like any other embedded file. (was: Files (esp images) and other objects can be embedded in html/css/javascript with the {{data: uri scheme}}. We should extract those like any other embedded file.) > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) and other objects can be embedded in html/css/javascript > with the [data: uri scheme|https://en.wikipedia.org/wiki/Data_URI_scheme]. > We should extract those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350614#comment-16350614 ] Markus Jelsma commented on TIKA-2563: - Ah, thanks :) > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350606#comment-16350606 ] Tim Allison commented on TIKA-2563: --- Right. Sorry. I meant the {{testHTML_embedded_img.html}}, NOT the file you shared. Thank you, again! > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350604#comment-16350604 ] Markus Jelsma commented on TIKA-2563: - I am not sure if ASL 2.0 friendly would apply. I took it some time ago from a live page of a Dutch non-profift association, for test purposes. > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350593#comment-16350593 ] Tim Allison commented on TIKA-2563: --- ASF 2.0 friendly example file based on example file kindly supplied by [~markus17] > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Attachment: testHTML_embedded_img.html > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html, testHTML_embedded_img.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350556#comment-16350556 ] Tim Allison commented on TIKA-1599: --- >Tim, if attached file is what you are looking for, i've got about 80 specimens >that came up when grepping for base64. W00t! Thank you! That one should do...and, duh, grep for base64. Thank you! >On topic, our parser on top of Tika relies on a custom ContentHandler >implementation. We (my company) would not be too happy if we would have to >rewrite the whole thing. Same goes for Apache Nutch. Oh...that's good to know...so I guess we're back to the option of supporting both Tagsoup and JSoup with users specifying via tika-config.xml which parser to use? > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350555#comment-16350555 ] Tim Allison commented on TIKA-2563: --- Attached example file supplied by [~markus17] on TIKA-1599. Thank you! > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2563) Extract embedded files in HTML
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Attachment: consumentenbond.html > Extract embedded files in HTML > -- > > Key: TIKA-2563 > URL: https://issues.apache.org/jira/browse/TIKA-2563 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: consumentenbond.html > > > Files (esp images) can be base64 encoded in HTML files. We should extract > those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350550#comment-16350550 ] Tim Allison commented on TIKA-2490: --- Y, sorry. We could change this behavior back to ignore missing dependencies...but I think [~pascal.essiembre] and [~mcaruanagalizia] had good arguments for why we should include warnings. > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2563) Extract embedded files in HTML
Tim Allison created TIKA-2563: - Summary: Extract embedded files in HTML Key: TIKA-2563 URL: https://issues.apache.org/jira/browse/TIKA-2563 Project: Tika Issue Type: Improvement Reporter: Tim Allison Files (esp images) can be base64 encoded in HTML files. We should extract those like any other embedded file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350545#comment-16350545 ] Markus Jelsma commented on TIKA-1599: - On topic, our parser on top of Tika relies on a custom ContentHandler implementation. We (my company) would not be too happy if we would have to rewrite the whole thing. Same goes for Apache Nutch. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350543#comment-16350543 ] Andrei Rebegea commented on TIKA-2490: -- OK. Thanks for the answer. So "Is this still suppose to happen ?" Answer: Yes. unless you make some modification to specify a tika-config.xml with this property set: > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350541#comment-16350541 ] Markus Jelsma commented on TIKA-1599: - Tim, if attached file is what you are looking for, i've got about 80 specimens that came up when grepping for base64. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1599: Attachment: consumentenbond.html > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350538#comment-16350538 ] Tim Allison commented on TIKA-2490: --- Whoa, welcome to modernity. :) >I am assuming that just by importing the libs, and using them(without special >configuration), we should not get these warnings. Unfortunately, no, those warnings are supposed to be evident unless you turn them off...see TIKA-2232. >From what I can tell from your links, you're using the default TikaConfig. If >you can specify an actual tika-config.xml file, that should help. > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350529#comment-16350529 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 3:39 PM: --- >DOM could lead to higher memory usage Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so we aren't going to see major problems in that corpus. I added [~markus17] 's attached files to our regression corpus, and I've kicked off a fresh full run of Tika 1.17 against the corpus. I've updated my jsoup code on my personal fork. Once the 1.17 run finishes, I'll kick off the jsoup fork against the html files. Unrelated topic: does anyone have a shareable example of an html file with a base64 (or other) embedded file inside of an html file? I don't think we're currently handling these, and it would be nice to do that. was (Author: talli...@mitre.org): >DOM could lead to higher memory usage Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so we aren't going to see major problems in that corpus. I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've updated my jsoup code on my personal fork. Once the 1.17 run finishes, I'll kick off the jsoup fork against the html files. Unrelated topic: does anyone have a shareable example of an html file with a base64 (or other) embedded file inside of an html file? I don't think we're currently handling these, and it would be nice to do that. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350529#comment-16350529 ] Tim Allison commented on TIKA-1599: --- >DOM could lead to higher memory usage Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so we aren't going to see major problems in that corpus. I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've updated my jsoup code on my personal fork. Once the 1.17 run finishes, I'll kick off the jsoup fork against the html files. Unrelated topic: does anyone have a shareable example of an html file with a base64 (or other) embedded file inside of an html file? I don't think we're currently handling these, and it would be nice to do that. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350528#comment-16350528 ] Andrei Rebegea commented on TIKA-2490: -- The short answer is that I don't know the full details, sorry. I have not worked on the implementation part - only on the upgrade part. We just updated our tika from version 1.6 (yes 6 not 16) to version 1.17, so we are using tika just like we have been using it when it was on version 1.6. I am assuming that just by importing the libs, and using them(without special configuration), we should not get these warnings. We have a TikaConfig object that seems to be shared around : [here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/resources/alfresco/content-services-context.xml#L180] then we use it : for example [here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/resources/alfresco/content-services-context.xml#L292] and we instantiate the new AutoDetectParser : for example [here|https://github.com/Alfresco/alfresco-repository/blob/master/src/main/java/org/alfresco/repo/content/metadata/TikaAutoMetadataExtracter.java#L78] > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350504#comment-16350504 ] Luis Filipe Nassif commented on TIKA-1599: -- Hi [~talli...@mitre.org], Moving to DOM could lead to higher memory usage and maybe bring memory problems like those we had experienced with the Office DOM parsers. But given all the problems of TagSoup, I think it is worth doing a new evaluation to see if we can get more content and the lost metadata back (from your previous test). > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350420#comment-16350420 ] Tim Allison commented on TIKA-2490: --- No, this isn't supposed to happen if you use the example {{tika-config.xml}} above. How are you calling Tika? > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2490: -- Fix Version/s: 1.17 > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 1.17 > > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350418#comment-16350418 ] Andrei Rebegea commented on TIKA-2490: -- Hello, I am using tika version 1.17 and still getting these warnings at startup. {code} Feb 02, 2018 11:09:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. TIFFImageWriter not loaded. tiff files will not be processed See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Feb 02, 2018 11:09:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. {code} I don't see a fix version on this task, so I don't know if it made it to 1.17, so, my questions: *Is this still suppose to happen ?* > Turn off stderr warnings in Tika-app > > > Key: TIKA-2490 > URL: https://issues.apache.org/jira/browse/TIKA-2490 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.16 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Attachments: NUTCH-2439-1.17.patch > > > Let's get rid of the stderr messages in tika-app and confirm that users can > turn off warnings via tika-config.xml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350406#comment-16350406 ] Hudson commented on TIKA-2561: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1429 (See [https://builds.apache.org/job/Tika-trunk/1429/]) TIKA-2561 -- update jsoup version in grib parser to avoid xss vuln (tallison: [https://github.com/apache/tika/commit/c80241952fa2f515687c6479768d24d7e907653c]) * (edit) tika-parsers/pom.xml > Tika Parser includes oudated/vulnerable version of JSoup > > > Key: TIKA-2561 > URL: https://issues.apache.org/jira/browse/TIKA-2561 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Asela >Priority: Major > Fix For: 2.0, 1.18 > > > org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2. > > JSoup versions older than 1.8.3 have a vulnerability in parsing. > > https://nvd.nist.gov/vuln/detail/CVE-2015-6748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350335#comment-16350335 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 1:42 PM: --- What say we do a fresh eval on our current corpus and then do a clean cut over to JSoup for Tika 2.0 if the results are promising? Big question: are we willing to move to DOM for HTML. SAX is not yet available in JSoup (https://github.com/jhy/jsoup/issues/824). was (Author: talli...@mitre.org): What say we do a fresh eval on our current corpus and then do a clean cut over to JSoup for Tika 2.0 if the results are promising? > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350335#comment-16350335 ] Tim Allison commented on TIKA-1599: --- What say we do a fresh eval on our current corpus and then do a clean cut over to JSoup for Tika 2.0 if the results are promising? > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350330#comment-16350330 ] Tim Allison commented on TIKA-2562: --- This is a "feature" of tagsoup see, e.g. [https://groups.google.com/forum/#!topic/tagsoup-friends/EfB6i12xBLw] I'm hesitant to fix this in Tika because we should probably migrate to jsoup, which is actively supported (TIKA-1599). > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2561. --- Resolution: Fixed Fix Version/s: 1.18 2.0 > Tika Parser includes oudated/vulnerable version of JSoup > > > Key: TIKA-2561 > URL: https://issues.apache.org/jira/browse/TIKA-2561 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Asela >Priority: Major > Fix For: 2.0, 1.18 > > > org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2. > > JSoup versions older than 1.8.3 have a vulnerability in parsing. > > https://nvd.nist.gov/vuln/detail/CVE-2015-6748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2561) Tika Parser includes oudated/vulnerable version of JSoup
[ https://issues.apache.org/jira/browse/TIKA-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350302#comment-16350302 ] Tim Allison commented on TIKA-2561: --- This is helpful. It boggles my imagination that this could be a problem for the grib parser in our context, but I've had failures of imagination before, and it is better to include deps that don't have known vulns in case another parser winds up pulling it in or in case my imagination fails :). Upgrade made. Thank you! > Tika Parser includes oudated/vulnerable version of JSoup > > > Key: TIKA-2561 > URL: https://issues.apache.org/jira/browse/TIKA-2561 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 >Reporter: Asela >Priority: Major > > org.apache.tika:tika-parsers:1.17 pulls in dependency JSoup 1.7.2. > > JSoup versions older than 1.8.3 have a vulnerability in parsing. > > https://nvd.nist.gov/vuln/detail/CVE-2015-6748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)