[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922 ] Tilman Hausherr commented on TIKA-4245: --- The file claims to be utf-16 but it isn't. If I change it to utf-8 in the editor then I get an NPE in the GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908 ] Tilman Hausherr commented on TIKA-4245: --- Happens also with the tika app GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4245: -- Description: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. {code:java} import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1); } Metadata metadata = new Metadata(); FileInputStream inputData = new FileInputStream(inputFile.toFile()); TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); Parser autoDetectParser = new AutoDetectParser(config); ParseContext context = new ParseContext(); context.set(TikaConfig.class, config); autoDetectParser.parse(inputData, handler, metadata, context); String content; if (largeFile) { content = FileUtils.readFileToString(outputFilePath.toFile()); } else { content = handler.toString(); } System.out.println("content = " + content); } catch(Exception ex) { ex.printStackTrace(); } finally { if (outputFileWriter != null) { outputFileWriter.close(); } } } } {code} was: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1);
[jira] [Created] (TIKA-4245) Tika does not get html content properly
Xiaohong Yang created TIKA-4245: --- Summary: Tika does not get html content properly Key: TIKA-4245 URL: https://issues.apache.org/jira/browse/TIKA-4245 Project: Tika Issue Type: Bug Reporter: Xiaohong Yang Attachments: Sample html file and tika config xml.zip We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1); } Metadata metadata = new Metadata(); FileInputStream inputData = new FileInputStream(inputFile.toFile()); TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); Parser autoDetectParser = new AutoDetectParser(config); ParseContext context = new ParseContext(); context.set(TikaConfig.class, config); autoDetectParser.parse(inputData, handler, metadata, context); String content; if (largeFile) { content = FileUtils.readFileToString(outputFilePath.toFile()); } else { content = handler.toString(); } System.out.println("content = " + content); } catch(Exception ex) { ex.printStackTrace(); } finally { if (outputFileWriter != null) { outputFileWriter.close(); } } } } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840893#comment-17840893 ] Hudson commented on TIKA-4244: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1612 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1612/]) TIKA-4244 -- improve ics detection (#1731) (github: [https://github.com/apache/tika/commit/f78dc999be9c0d87a83b54aa6af74fbcf996f22e]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testICalendar_w_prodId.ics * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java > Tika idenifies MIME type of ics files with html content as text/html > > > Key: TIKA-4244 > URL: https://issues.apache.org/jira/browse/TIKA-4244 > Project: Tika > Issue Type: Bug >Reporter: Kartik Jain >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: Sample.ics > > > When tika-core detect(InputStream input, Metadata metadata) API is used to > determimne the MIME type of an ics file, it returns media type `text/html`, > rather it should've `text/calendar`. > For .ics files that have HTML content in them (additional attribute > X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such > files as text/html, ideally, it should come up as text/calendar, but > according to tika core text/html is not in the base types of text/calendar so > it doesn't consider the text/calendar type, however for all ics files MIME > type should be text/calendar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4244. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Thank you [~boomxlucifer]! > Tika idenifies MIME type of ics files with html content as text/html > > > Key: TIKA-4244 > URL: https://issues.apache.org/jira/browse/TIKA-4244 > Project: Tika > Issue Type: Bug >Reporter: Kartik Jain >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: Sample.ics > > > When tika-core detect(InputStream input, Metadata metadata) API is used to > determimne the MIME type of an ics file, it returns media type `text/html`, > rather it should've `text/calendar`. > For .ics files that have HTML content in them (additional attribute > X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such > files as text/html, ideally, it should come up as text/calendar, but > according to tika core text/html is not in the base types of text/calendar so > it doesn't consider the text/calendar type, however for all ics files MIME > type should be text/calendar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840860#comment-17840860 ] ASF GitHub Bot commented on TIKA-4244: -- tballison merged PR #1731: URL: https://github.com/apache/tika/pull/1731 > Tika idenifies MIME type of ics files with html content as text/html > > > Key: TIKA-4244 > URL: https://issues.apache.org/jira/browse/TIKA-4244 > Project: Tika > Issue Type: Bug >Reporter: Kartik Jain >Priority: Major > Attachments: Sample.ics > > > When tika-core detect(InputStream input, Metadata metadata) API is used to > determimne the MIME type of an ics file, it returns media type `text/html`, > rather it should've `text/calendar`. > For .ics files that have HTML content in them (additional attribute > X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such > files as text/html, ideally, it should come up as text/calendar, but > according to tika core text/html is not in the base types of text/calendar so > it doesn't consider the text/calendar type, however for all ics files MIME > type should be text/calendar -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4244 -- improve ics detection [tika]
tballison merged PR #1731: URL: https://github.com/apache/tika/pull/1731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840850#comment-17840850 ] ASF GitHub Bot commented on TIKA-4244: -- tballison opened a new pull request, #1731: URL: https://github.com/apache/tika/pull/1731 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Tika idenifies MIME type of ics files with html content as text/html > > > Key: TIKA-4244 > URL: https://issues.apache.org/jira/browse/TIKA-4244 > Project: Tika > Issue Type: Bug >Reporter: Kartik Jain >Priority: Major > Attachments: Sample.ics > > > When tika-core detect(InputStream input, Metadata metadata) API is used to > determimne the MIME type of an ics file, it returns media type `text/html`, > rather it should've `text/calendar`. > For .ics files that have HTML content in them (additional attribute > X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such > files as text/html, ideally, it should come up as text/calendar, but > according to tika core text/html is not in the base types of text/calendar so > it doesn't consider the text/calendar type, however for all ics files MIME > type should be text/calendar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840852#comment-17840852 ] Tim Allison commented on TIKA-4244: --- Thank you [~boomxlucifer] for finding this and reporting it. The problem is that we were too strict in how close the "VERSION:2.0" had to be to the top of the file. I've fixed that in the above PR. > Tika idenifies MIME type of ics files with html content as text/html > > > Key: TIKA-4244 > URL: https://issues.apache.org/jira/browse/TIKA-4244 > Project: Tika > Issue Type: Bug >Reporter: Kartik Jain >Priority: Major > Attachments: Sample.ics > > > When tika-core detect(InputStream input, Metadata metadata) API is used to > determimne the MIME type of an ics file, it returns media type `text/html`, > rather it should've `text/calendar`. > For .ics files that have HTML content in them (additional attribute > X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such > files as text/html, ideally, it should come up as text/calendar, but > according to tika core text/html is not in the base types of text/calendar so > it doesn't consider the text/calendar type, however for all ics files MIME > type should be text/calendar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] TIKA-4244 -- improve ics detection [tika]
tballison opened a new pull request, #1731: URL: https://github.com/apache/tika/pull/1731 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org