[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922
 ] 

Tilman Hausherr commented on TIKA-4245:
---

The file claims to be utf-16 but it isn't. If I change it to utf-8 in the 
editor then I get an NPE in the GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908
 ] 

Tilman Hausherr commented on TIKA-4245:
---

Happens also with the tika app GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4245:
--
Description: 
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 {code:java}
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
 
public class ExtractTxtFromHtml {
private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
 
public static void main(String args[]) {
extactText(false);
extactText(true);
}
 
static void extactText(boolean largeFile) {
PrintWriter outputFileWriter = null;
try {
BodyContentHandler handler;
Path outputFilePath = null;
 
if (largeFile) {
// write tika output to disk
outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));
handler = new BodyContentHandler(outputFileWriter);
} else {
// stream it in memory
handler = new BodyContentHandler(-1);
}
 
Metadata metadata = new Metadata();
FileInputStream inputData = new FileInputStream(inputFile.toFile());
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
autoDetectParser.parse(inputData, handler, metadata, context);
 
String content;
if (largeFile) {
content = FileUtils.readFileToString(outputFilePath.toFile());
}
else {
content = handler.toString();
}
System.out.println("content = " + content);
}
catch(Exception ex) {
ex.printStackTrace();
} finally {
if (outputFileWriter != null) {
outputFileWriter.close();
}
}
}
}
{code}


  was:
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 

import org.apache.commons.io.FileUtils;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ExtractTxtFromHtml {

    private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();

 

    public static void main(String args[]) {

    extactText(false);

    extactText(true);

    }

 

    static void extactText(boolean largeFile) {

    PrintWriter outputFileWriter = null;

    try {

    BodyContentHandler handler;

    Path outputFilePath = null;

 

    if (largeFile) {

    // write tika output to disk

    outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");

    outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));

    handler = new BodyContentHandler(outputFileWriter);

    } else {

    // stream it in memory

    handler = new BodyContentHandler(-1);

  

[jira] [Created] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4245:
---

 Summary: Tika does not get html content properly 
 Key: TIKA-4245
 URL: https://issues.apache.org/jira/browse/TIKA-4245
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang
 Attachments: Sample html file and tika config xml.zip

We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 

import org.apache.commons.io.FileUtils;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ExtractTxtFromHtml {

    private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();

 

    public static void main(String args[]) {

    extactText(false);

    extactText(true);

    }

 

    static void extactText(boolean largeFile) {

    PrintWriter outputFileWriter = null;

    try {

    BodyContentHandler handler;

    Path outputFilePath = null;

 

    if (largeFile) {

    // write tika output to disk

    outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");

    outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));

    handler = new BodyContentHandler(outputFileWriter);

    } else {

    // stream it in memory

    handler = new BodyContentHandler(-1);

    }

 

    Metadata metadata = new Metadata();

    FileInputStream inputData = new FileInputStream(inputFile.toFile());

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");

    Parser autoDetectParser = new AutoDetectParser(config);

    ParseContext context = new ParseContext();

    context.set(TikaConfig.class, config);

    autoDetectParser.parse(inputData, handler, metadata, context);

 

    String content;

    if (largeFile) {

    content = FileUtils.readFileToString(outputFilePath.toFile());

    }

    else {

    content = handler.toString();

    }

    System.out.println("content = " + content);

    }

    catch(Exception ex) {

    ex.printStackTrace();

    } finally {

    if (outputFileWriter != null) {

    outputFileWriter.close();

    }

    }

    }

}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840893#comment-17840893
 ] 

Hudson commented on TIKA-4244:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1612 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1612/])
TIKA-4244 -- improve ics detection (#1731) (github: 
[https://github.com/apache/tika/commit/f78dc999be9c0d87a83b54aa6af74fbcf996f22e])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testICalendar_w_prodId.ics
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java


> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4244.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

Thank you [~boomxlucifer]!

> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840860#comment-17840860
 ] 

ASF GitHub Bot commented on TIKA-4244:
--

tballison merged PR #1731:
URL: https://github.com/apache/tika/pull/1731




> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4244 -- improve ics detection [tika]

2024-04-25 Thread via GitHub


tballison merged PR #1731:
URL: https://github.com/apache/tika/pull/1731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840850#comment-17840850
 ] 

ASF GitHub Bot commented on TIKA-4244:
--

tballison opened a new pull request, #1731:
URL: https://github.com/apache/tika/pull/1731

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840852#comment-17840852
 ] 

Tim Allison commented on TIKA-4244:
---

Thank you [~boomxlucifer] for finding this and reporting it. The problem is 
that we were too strict in how close the "VERSION:2.0" had to be to the top of 
the file. I've fixed that in the above PR.

> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4244 -- improve ics detection [tika]

2024-04-25 Thread via GitHub


tballison opened a new pull request, #1731:
URL: https://github.com/apache/tika/pull/1731

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org