[
https://issues.apache.org/jira/browse/TIKA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719930#comment-17719930
]
Tim Allison commented on TIKA-3971:
-----------------------------------
We hit this issue in running the regression tests in prep for the release of
2.8.0-rc1. These files are or contain postcript based illustrator files:
bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-689926-0.ai
commoncrawl3/6L/6LXM2WG4XJVRMCUKYYXKC4AZ3JXG7NDW
commoncrawl3/EU/EUGUZCM4BEBH76GYTB3POOHX6SRGXV2B
commoncrawl3/FF/FF6I5VHENM5PJL6PZ3I2JPDFYKLP3YJJ
commoncrawl3/GW/GWSKJC222GMMABIWB3CDOXZTNBXOTAPM
commoncrawl3/NU/NUYOXS4S73FAJRG7UMQYZBRADY7RN4OV
commoncrawl3/OD/ODOHEVU5EUOUZMIC73Q5PLSTTK65UIV7
commoncrawl3/ON/ONQEOEWJ37EC77EUUWZJZTKZDD7EUULH
commoncrawl3/QV/QVL2U5OLQM4XAMWYEONEUEEGFBGIKJUZ
The issue is that now that we have {{application/postscript}} as subtype of
pdf, if the filename ends with *.ai but the byte detection is postscript, *.ai
is no longer a subtype of postscript, so the file type is identified as
postscript instead of illustrator.
> Distinguish eps-based Adobe Illustrator files from pdf-based Illustrator files
> ------------------------------------------------------------------------------
>
> Key: TIKA-3971
> URL: https://issues.apache.org/jira/browse/TIKA-3971
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> On TIKA-2689, we plan to add detection for Illustrator files that are based
> on/wrapped in PDF files at parse time. Illustrator files used to be eps or
> just ps. We should figure out how we want to distinguish between these two
> or three formats.
> TIKA-2689 has some great resource links to help with this.
> Pronom has a bunch of ids for "Illustrator", summarized:
> http://justsolve.archiveteam.org/wiki/Adobe_Illustrator_Artwork
> One example:
> https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1350
> See also: https://bugs.ghostscript.com/show_bug.cgi?id=689926
--
This message was sent by Atlassian Jira
(v8.20.10#820010)