[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221
 ] 

Tim Allison commented on TIKA-4245:
---

Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one option is 
to turn off the HtmlEncodingDetector.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841220#comment-17841220
 ] 

Tim Allison commented on TIKA-4245:
---

This is an ongoing area for improvement in Tika.

The algorithm is pick the first non-null charset. The default charset 
detectors: html tags, Mozilla's UniversalCharDet, ICU4j (if memory serves). So, 
Tika is configured to trust the charset in the html if it exists. If you want 
to turn off this behavior and go with a purely statistical detector, you can 
configure UniversalCharDet and then ICU4j.

The solution that has been in the back of my mind for a long time now is a 
charset detector that runs the three detectors and then extracts text from an 
initial chunk of the document. It then picks the charset with the lowest out of 
vocabulary statistic. This is not yet implemented.

If you want to turn off the html tag detector, I can send a link.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841209#comment-17841209
 ] 

Xiaohong Yang commented on TIKA-4245:
-

[~tilman]  Can you detect the right charset (utf-8) and fix the issue?

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922
 ] 

Tilman Hausherr commented on TIKA-4245:
---

The file claims to be utf-16 but it isn't. If I change it to utf-8 in the 
editor then I get an NPE in the GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908
 ] 

Tilman Hausherr commented on TIKA-4245:
---

Happens also with the tika app GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)