[ 
https://issues.apache.org/jira/browse/TIKA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719930#comment-17719930
 ] 

Tim Allison commented on TIKA-3971:
-----------------------------------

We hit this issue in running the regression tests in prep for the release of 
2.8.0-rc1.  These files are or contain postcript based illustrator files:

bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-689926-0.ai
commoncrawl3/6L/6LXM2WG4XJVRMCUKYYXKC4AZ3JXG7NDW
commoncrawl3/EU/EUGUZCM4BEBH76GYTB3POOHX6SRGXV2B
commoncrawl3/FF/FF6I5VHENM5PJL6PZ3I2JPDFYKLP3YJJ
commoncrawl3/GW/GWSKJC222GMMABIWB3CDOXZTNBXOTAPM
commoncrawl3/NU/NUYOXS4S73FAJRG7UMQYZBRADY7RN4OV
commoncrawl3/OD/ODOHEVU5EUOUZMIC73Q5PLSTTK65UIV7
commoncrawl3/ON/ONQEOEWJ37EC77EUUWZJZTKZDD7EUULH
commoncrawl3/QV/QVL2U5OLQM4XAMWYEONEUEEGFBGIKJUZ

 

The issue is that now that we have {{application/postscript}} as subtype of 
pdf, if the filename ends with *.ai but the byte detection is postscript, *.ai 
is no longer a subtype of postscript, so the file type is identified as 
postscript instead of illustrator.

> Distinguish eps-based Adobe Illustrator files from pdf-based Illustrator files
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3971
>                 URL: https://issues.apache.org/jira/browse/TIKA-3971
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> On TIKA-2689, we plan to add detection for Illustrator files that are based 
> on/wrapped in PDF files at parse time.  Illustrator files used to be eps or 
> just ps.  We should figure out how we want to distinguish between these two 
> or three formats.
> TIKA-2689 has some great resource links to help with this.
> Pronom has a bunch of ids for "Illustrator", summarized: 
> http://justsolve.archiveteam.org/wiki/Adobe_Illustrator_Artwork
> One example: 
> https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1350
> See also: https://bugs.ghostscript.com/show_bug.cgi?id=689926



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to