[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-21 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479981#comment-17479981
 ] 

Nick Burch commented on TIKA-3656:
--

How are you calling Tika? And do you have the office parsers on your classpath 
along with all their dependencies?

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-23 Thread Ajesh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480825#comment-17480825
 ] 

Ajesh commented on TIKA-3656:
-

Calling from java application like below.
{code:java}
String contentType = new Tika().detect(new File(myFile.doc));
{code}
I have only below dependency in my pom file.

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480955#comment-17480955
 ] 

Nick Burch commented on TIKA-3656:
--

That POM is your problem, you aren't including any of the container aware 
dependencies which comes with the Parsers

Try adding a dependency such as tika-parsers-standard or 
tika-parser-microsoft-module

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Ajesh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481007#comment-17481007
 ] 

Ajesh commented on TIKA-3656:
-

sure will do

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Ajesh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481533#comment-17481533
 ] 

Ajesh commented on TIKA-3656:
-

[~nick] But it is detecting the right content type of a
{code:java}
pplication/vnd.openxmlformats-officedocument.wordprocessingml.document{code}
 which has both content type and file extension as docx.

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Ajesh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481539#comment-17481539
 ] 

Ajesh commented on TIKA-3656:
-

Let me clean the air by adding bit more logs from the application for a better 
idea.

*Scenario - 1*

Sample.docx (Content-type - DOCX, extension - DOCX)
{code:java}
Document name original - sample.docx
Content type - 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Content type detected, ready to convert to pdf - 
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
*Scenario - 2*

Sample.pdf (Content-type - DOCX, extension - PDF)
{code:java}
Document name original - sample.pdf
Content type - application/zip
10:04:58.139 [http-nio-8080-exec-6] ERROR 
com.hiringsteps.ats.applicant.facade.impl.ApplicantFacade - Error :
org.apache.xmlbeans.impl.piccolo.io.FileFormatException: Unsupported file type 
- [ application/zip ] {code}
Here we are expecting the content type as 
{code:java}
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
This means if someone wrongly renamed the file extension we should be able to 
detect the right type by reading the file content.

> Tika returns wrong content type for docx types.
> ---
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
>Reporter: Ajesh
>Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)