[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731249#comment-17731249 ] Hudson commented on TIKA-4060: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/]) TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values (nick: [https://github.com/apache/tika/commit/500900d67ede02e87440caa9f67501d3fe59b770]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac > Add magic to audio/aac in tika-mimetypes.xml > > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task >Reporter: Gregory Lepore >Priority: Minor > Fix For: 2.8.1 > > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730728#comment-17730728 ] Nick Burch commented on TIKA-4060: -- I'm a muppet... had forgotten to escape the hex characters in the regexp when transposing into a Tika mime magic match! Now fixed and applied. Thanks for helping us find this magic > Add magic to audio/aac in tika-mimetypes.xml > > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task >Reporter: Gregory Lepore >Priority: Minor > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730649#comment-17730649 ] Nick Burch commented on TIKA-4060: -- 0x494443 is the string ID3, which I think ought to be at the start. It is in the handful of files I've found. The rest of the magic is pretty vague and a little prone to false positives, so I'm reluctant to match on the string "ID3" anywhere in the first 2kb and then the vague 3 bytes somewhere else further on. I've tried to make the matches a little "tighter" to hopefully reduce false positives, just seem to have gone too tight - the test file I produced with ID3 tags does have the ID3 at the start. The hex dump key sections are: {{ 49 44 33 03 00 00 00 00 09 6b 54 50 45 31 00 00 |ID3..kTPE1..|}} {{0010 00 0c 00 00 00 54 65 73 74 20 41 72 74 69 73 74 |.Test Artist|}} {{...}} {{0090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||}} {{*}} {{04f0 00 00 00 00 00 ff f1 50 80 32 5f fc de 02 00 4c |...P.2_L|}} > Add magic to audio/aac in tika-mimetypes.xml > > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task >Reporter: Gregory Lepore >Priority: Minor > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730315#comment-17730315 ] Gregory Lepore commented on TIKA-4060: -- I'm not 100% sure, but I think the offset is incorrect for the ID3 version of the magic. The ID3 header can be anywhere from 0-2045 total bytes, after the 494433, so the offset of the FF(F0|F1|F8|F9)(40|41|44|45|48|49|4C|4D|50|51|54|55|58|59|5C|5D|60|61|64|65|68|69|6C|6D|70|71|80|81|84|85|88|89|8C|8D|90|91|94|95|98|99|9C|9D|A0|A1|A4|A5|A8|A9|AC|AD|B0|B1)(00|01|20|40|41|60|80|81|60|A0|C0|C1|E0) values can be anywhere from 3 to 2049 (I think that's right). I would try sneaking up on it by matching the offset to the exact values in your test files and then worrying about the full range of possible offsets. I often build up my signatures that way. If the above doesn't work, I can work on figuring out the Tika mimetype meaning of offset="512:2048". The PRONOM equivalent is the value in the curly braces, in this case the \{0-2045} means the subsequent values can appear anywhere from 0 to 2045 bytes after the 494443. Does that make sense? > Add magic to audio/aac in tika-mimetypes.xml > > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task >Reporter: Gregory Lepore >Priority: Minor > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730304#comment-17730304 ] Nick Burch commented on TIKA-4060: -- I have created some small test AAC files using ffmpeg, and then had a go at adding the mime magic for the two cases. However, detection of the ID3 header case isn't working. Can anyone spot what I've done wrong? https://github.com/apache/tika/tree/TIKA-4060 > Add magic to audio/aac in tika-mimetypes.xml > > > Key: TIKA-4060 > URL: https://issues.apache.org/jira/browse/TIKA-4060 > Project: Tika > Issue Type: Sub-task >Reporter: Gregory Lepore >Priority: Minor > Attachments: > 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, > cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1 > > > Currently tika-mimetypes only recognizes audio/aac files by the file > extension. PRONOM recently added support for identifying aac files, but the > signature is tricky. There are two signatures, below in PRONOM format curly > braces mean to look ahead between the two values for the subsequent patterns. > > The first pattern is pretty basic, the second pattern is the first pattern > after a 2048 ID3 header. > > ||Name|Audio Data Transport Stream sig.1| > ||Description|An FF pattern from BOF with variation of byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | > ||Name|Audio Data Transport Stream sig.2| > ||Description|ID3 tag variation with variable byte stream| > ||Byte sequences| > ||Position type|Absolute from BOF| > ||Offset|0| > ||Maximum Offset|0| > ||Byte order| | > ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)| > | -- This message was sent by Atlassian Jira (v8.20.10#820010)