Tika fails to parse some MP3 tags correctly and produces null characters in 
value
---------------------------------------------------------------------------------

                 Key: TIKA-887
                 URL: https://issues.apache.org/jira/browse/TIKA-887
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
            Reporter: Jens Hübel
            Priority: Minor


I have a problem when extracting the comment tag from an MP3 file. It contains 
an invalid prefix then a '\0' character and then the real value of the tag. 
This happpens with files downloaded from www.jamendo.com, for example this one:
http://storage.newjamendo.com/download/track/450545/mp32/Swansong.mp3

It may be that the tags are not created properly on this site, but at least 
tools like mp3tag display them correctly.

The extracted value looks like this: eng http://www.jamendo.com 
Attribution-Noncommercial-Share Alike 3.0

At position 3 there is a null character. The tag value should start with http...

Here is the byte sequence at the beginning of this file:
49 44 33 04 00 00 00 01 18 32 54 49 54 32 00 00 
00 09 00 00 03 53 77 61 6E 73 6F 6E 67 54 50 45 
31 00 00 00 0E 00 00 03 4A 6F 73 68 20 57 6F 6F 
64 77 61 72 64 54 41 4C 42 00 00 00 0C 00 00 03 
42 72 65 61 64 63 72 75 6D 62 73 54 44 52 4C 00 
00 00 05 00 00 03 32 30 30 39 43 4F 4D 4D 00 00 
00 22 00 00 03 65 6E 67 49 44 33 20 76 31 20 43 
6F 6D 6D 65 6E 74 00 41 74 74 72 69 62 75 74 69 
6F 6E 20 33 2E 30 54 43 4F 4E 00 00 00 06 00 00 
03 28 32 35 35 29 54 50 55 42 00 00 00 08 00 00 
03 4A 61 6D 65 6E 64 6F 43 4F 4D 4D 00 00 00 2C 
00 00 03 65 6E 67 00 68 74 74 70 3A 2F 2F 77 77 
77 2E 6A 61 6D 65 6E 64 6F 2E 63 6F 6D 20 41 74 
74 72 69 62 75 74 69 6F 6E 20 33 2E 30 20 54 43 
4F 50 00 00 01 1F 00 00 03 32 30 30 39 2D 31 30 
2D 32 31 54 31 31 3A 31 31 3A 32 30 2B 30 31 3A 
30 30 20 4A 6F 73 68 20 57 6F 6F 64 77 61 72 64 
2E 20 4C 69 63 65 6E 73 65 64 20 74 6F 20 74 68

ID3......2TIT2.......SwansongTPE1.......Josh 
WoodwardTALB.......BreadcrumbsTDRL.......2009COMM..."...engID3 v1 
Comment.Attribution 
3.0TCON.......(255)TPUB.......JamendoCOMM...,...eng.http://www.jamendo.com 
Attribution 3.0 TCOP.......2009-10-21T11:11:20+01:00 Josh Woodward. Licensed to 
th


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to