[ 
https://issues.apache.org/jira/browse/TIKA-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472053#comment-17472053
 ] 

Tim Allison edited comment on TIKA-3634 at 1/10/22, 2:29 PM:
-------------------------------------------------------------

Thank you for submitting the bug and sharing triggering files.

A couple of items unrelated to the problem:
 * AppleSingleFileParser does not handle iworks files.  That is for a 
completely unrelated file format: 
[https://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats]
 * You shouldn't need to add: tika-parser-zip-commons,tika-parser-apple-module. 
 These should be included in tika-parsers-standard-package.  If they're not, 
that's a serious problem.  Please open a different ticket.

I regret I'm still not clear on what we need to fix.

With Tika 1.28, I get {{application/vnd.apple.unknown.13}} for the *.numbers 
file and *.pages file; I get {{application/vnd.apple.keynote.13}} for the .key 
file.  No attachments or text are extracted from any of those.

 

With Tika 2.2.1, I get {{application/vnd.apple.unknown.13}} all three (*.pages, 
*.key , *.numbers files), but then the packageparser parses all embedded files 
that Tika supports.

 

What is the desired behavior?

As you've pointed out, we don't have a parser for these formats, and it would 
be non-trivial. :(

 

My guess is that you want the same detection as 1.28, but with the parsing of 
all component files?  We could probably fall back to the file name for the 
distinction btwn pages and numbers.


was (Author: talli...@mitre.org):
Thank you for submitting the bug and sharing triggering files.

A couple of items unrelated to the problem:
 * AppleSingleFileParser does not handle iworks files.  That is for a 
completely unrelated file format: 
[https://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats]
 * You shouldn't need to add: tika-parser-zip-commons,tika-parser-apple-module. 
 These should be included in tika-parsers-standard-package.  If they're not, 
that's a serious problem.  Please open a different ticket.

I regret I'm still not clear on what we need to fix.

With Tika 1.28, I get {{application/vnd.apple.unknown.13}} for the *.numbers 
file and *.pages file; I get {{application/vnd.apple.keynote.13}} for the .key 
file.  No attachments or text are extracted from any of those.

 

With Tika 2.2.1, I get {{application/vnd.apple.unknown.13}} all three (*.pages, 
*.key , *.numbers files), but then the packageparser parses all embedded files 
that Tika supports.

 

What is the desired behavior?

As you've pointed out, we don't have a parser for these formats, and it would 
be non-trivial. :(

 

My guess is that you want the same detection as 1.28, but with the parsing of 
all component files?

> Failed to Parser Apple related files
> ------------------------------------
>
>                 Key: TIKA-3634
>                 URL: https://issues.apache.org/jira/browse/TIKA-3634
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.2.1
>            Reporter: Tika User
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: brochure.pages, keynotecreated.key, 
> mortgagecalculator.numbers
>
>
> Unable to parse '.Number', '.key', '.pages' file using below class in xml 
> file(org.apache.tika.parser.apple.AppleSingleFileParser)
> Getting unkown mimetype : application/vnd.apple.unknown.13
> Using all these modules :
> tika-core,tika-parsers-standard-package,tika-parser-microsoft-module,tika-parser-sqlite3-package,tika-parser-scientific-module,tika-parser-zip-commons,tika-parser-apple-module



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to