[jira] [Commented] (TIKA-1770) AutoDetectParser wrongly detects plain text as images/audio

2022-07-08 Thread Emil Zegers (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564468#comment-17564468
 ] 

Emil Zegers commented on TIKA-1770:
---

Wrong recognition of text files still happens with Tika 2.4.1. Took me a while 
to understand what was going on and then I found the bug report. Curious to 
understand what would be needed to fix this. Happy user of Tika but I don't 
know the code base yet.

> AutoDetectParser wrongly detects plain text as images/audio
> ---
>
> Key: TIKA-1770
> URL: https://issues.apache.org/jira/browse/TIKA-1770
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
> Environment: OS independent (tested on both Windows, MAC OS)
>Reporter: Ziqi
>Priority: Minor
> Attachments: the-acl-rd-tec_chunk_10228.txt, 
> the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt
>
>
> AutoDetectParser fails to recognize certain plain-text files as plain text.
> In the attachment are three testing files, as you can see they are all plain 
> text.
> The following code is used for testing:
> 
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new File("path").listFiles()) {
> InputStream in = new BufferedInputStream(new 
> FileInputStream(f.toString()));
> BodyContentHandler handler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> try {
> parser.parse(in, handler, metadata);
> String content = handler.toString();
> System.out.println(metadata); //line A
> }catch (Exception e){
> e.printStackTrace();
> }
> }
> 
> for the three testing files, line A prints the following:
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> X-Parsed-By=org.apache.tika.parser.DefaultParser 
> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
> Content-Type=audio/mpeg 
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> And as a result, variable "content" is always empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1770) AutoDetectParser wrongly detects plain text as images/audio

2020-08-15 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178184#comment-17178184
 ] 

Peter Lee commented on TIKA-1770:
-

Test 3 given file in tika-1.24.1 . here is tika content-type detection result :

 
||File Name||Content Type||
|the-acl-rd-tec_chunk_15.txt|audio/mpeg|
|the-acl-rd-tec_chunk_9113.txt|image/x-portable-bitmap|
|the-acl-rd-tec_chunk_10228.txt|image/x-portable-bitmap|

 

Reason:

Content of file `the-acl-rd-tec_chunk_15.txt` start with string "ID3" which is 
magic byte of audio/mpeg.

Content of file `the-acl-rd-tec_chunk_9113.txt` start with string "P1" which is 
magic byte of image/x-portable-bitmap.

Content of file `the-acl-rd-tec_chunk_10228.txt` start with string "P4" which 
is magic byte of image/x-portable-bitmap.

 

After google these two formats, I can't find the way to improve these formats 
magic byte match configure.

Maybe we should setup a rule : some format must have both extendtion name and 
magic byte match.

> AutoDetectParser wrongly detects plain text as images/audio
> ---
>
> Key: TIKA-1770
> URL: https://issues.apache.org/jira/browse/TIKA-1770
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
> Environment: OS independent (tested on both Windows, MAC OS)
>Reporter: Ziqi
>Priority: Minor
> Attachments: the-acl-rd-tec_chunk_10228.txt, 
> the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt
>
>
> AutoDetectParser fails to recognize certain plain-text files as plain text.
> In the attachment are three testing files, as you can see they are all plain 
> text.
> The following code is used for testing:
> 
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new File("path").listFiles()) {
> InputStream in = new BufferedInputStream(new 
> FileInputStream(f.toString()));
> BodyContentHandler handler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> try {
> parser.parse(in, handler, metadata);
> String content = handler.toString();
> System.out.println(metadata); //line A
> }catch (Exception e){
> e.printStackTrace();
> }
> }
> 
> for the three testing files, line A prints the following:
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> X-Parsed-By=org.apache.tika.parser.DefaultParser 
> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
> Content-Type=audio/mpeg 
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> And as a result, variable "content" is always empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)