On Tue, 26 Oct 2004, Jim wrote: > I have a binary file that I have been tasked to discover the format of > and somehow convert the records to readable text. Is there any way I > can find out what binary format the file is in, so I can create an > template for unpack() to convert the binary to text?
The best place to start is with the `file` command, and the magic numbers behind it, which not nearly enough people know about these days. On Unix systems (or Cygwin on Windows), `file` uses a database of magic numbers -- fingerprints for different file types -- to identify files, regardless of how the file is named (i.e. the file extension doesn't matter here). For example, consider this output: % file ~/Movies/* 61980main_PIA06410-movie.mov: Apple QuickTime movie file (moov) CoLC_fog.mov: Apple QuickTime movie file (mdat) Don_Quijote_animation.avi: RIFF (little-endian) data, AVI, 320 x 240, 25.00 fps, video: DivX 5, audio: (mono, 8000 Hz) Jon_Stewart_Crossfire.rm: RealMedia file Mahnamahna.mpeg: MPEG system stream data Movies: symbolic link to `/Volumes/d2/Movies' Tenacious D - Tribute.mpeg: MPEG system stream data The Incredibles - trailer.mov: Apple QuickTime movie file (moov) crossfire-20041015.wmv: Microsoft ASF crossfire-20041015001.mp4: Apple QuickTime movie file (ftyp) crossfire-20041015001.mp4.html: XML document text goingupriver.dmg: Apple Partition data block size: 512, first type: Apple_partition_map, name: Apple, number of blocks: 63, second type: Apple_HFS, name: disk image, number of blocks: 1325920, goingupriver.mov: Apple QuickTime movie file (moov) % Note that this isn't looking at file extensions: there's multiple files with the ".mov" extension, but the command is able to figure out that they're actually different formats. It works via the magic (ahem) of the magic database, which describes predicted markers for many file types. To illustrate, consider the GIF format. Each GIF image begins with: * a signature, the three character string "GIF" * a version string, either "87a" or "89a" * image width & height, two bytes each * a color table, one byte * a background color index, one byte Here's what the magic database entry for GIF looks like: # GIF 0 string GIF8 GIF image data >4 string 7a \b, version 8%s, >4 string 9a \b, version 8%s, >6 leshort >0 %hd x >8 leshort >0 %hd You can puzzle out for yourself how this notation works, but it should be plain to see that the GIF fingerprint is being represented here. SO, long preamble aside, you want to do this in Perl, right? It looks like the module you want is File::Type or File::MMagic: use File::Type; my $ft = File::Type->new(); # read in data from file to $data, then my $type_from_data = $ft->checktype_contents($data); # alternatively, check file from disk my $type_from_file = $ft->checktype_filename($file); # convenient method for checking either a file or data my $type_1 = $ft->mime_type($file); my $type_2 = $ft->mime_type($data); -- or -- use File::MMagic; use FileHandle; $mm = new File::MMagic; # use internal magic file # $mm = File::MMagic->new('/etc/magic'); # use external magic file $res = $mm->checktype_filename("/somewhere/unknown/file"); $fh = new FileHandle "< /somewhere/unknown/file2"; $res = $mm->checktype_filehandle($fh); $fh->read($data, 0x8564); $res = $mm->checktype_contents($data); See <http://search.cpan.org/~pmison/File-Type/lib/File/Type.pm> or <http://search.cpan.org/~knok/File-MMagic-1.22/MMagic.pm> for details. The File::Type page includes a brief overview of the different modules availablee, with critiques of why the author felt that the others didn't quite do the job (which you may or may not agree with, that's okay). Take a look over these modules, then try writing some code (or cheat and just look it up with the `file` command) and let us know how it goes. -- Chris Devers -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>