[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730315#comment-17730315 ]
Gregory Lepore commented on TIKA-4060: -------------------------------------- I'm not 100% sure, but I think the offset is incorrect for the ID3 version of the magic. The ID3 header can be anywhere from 0-2045 total bytes, after the 494433, so the offset of the FF(F0|F1|F8|F9)(40|41|44|45|48|49|4C|4D|50|51|54|55|58|59|5C|5D|60|61|64|65|68|69|6C|6D|70|71|80|81|84|85|88|89|8C|8D|90|91|94|95|98|99|9C|9D|A0|A1|A4|A5|A8|A9|AC|AD|B0|B1)(00|01|20|40|41|60|80|81|60|A0|C0|C1|E0) values can be anywhere from 3 to 2049 (I think that's right). I would try sneaking up on it by matching the offset to the exact values in your test files and then worrying about the full range of possible offsets. I often build up my signatures that way. If the above doesn't work, I can work on figuring out the Tika mimetype meaning of offset="512:2048". The PRONOM equivalent is the value in the curly braces, in this case the \{0-2045} means the subsequent values can appear anywhere from 0 to 2045 bytes after the 494443. Does that make sense? > Add magic to audio/aac in tika-mimetypes.xml > -------------------------------------------- > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task > Reporter: Gregory Lepore > Priority: Minor > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)