Hachoir is a framework for binary file manipulation: file format
recognition, metadata extraction, search files in any binary stream
(forensics), view file content with human representation, etc. It's
composed of many component:

 Programs:
 * hachoir-metadata: fault tolerant metadata extraction;
 * hachoir-subfile: search subfiles in a disk image or any other
binary
   stream;
 * hachoir-urwid, hachoir-wx, hachoir-gtk, hachoir-gtk: user interface
to
   view file content (curses, wxPython, pygtk, web+ajax);

 Modules:
 * hachoir-core: library to split binary data into a field tree;
 * hachoir-parser: collection of 70 file format parsers;
 * hachoir-regex: regular expression optimization/manipulation and
pattern
   matching (used by hachoir-subfile).


Project website:
  http://hachoir.org/

List of supported file formats:
  http://hachoir.org/wiki/hachoir-parser#Listofparsers
  (jpeg, ttf, exe, rar, ogg, ntfs, ole2, torrent, ...)

Examples of metadata extraction:
  http://hachoir.org/wiki/hachoir-metadata/examples

hachoir-wx screenshots:
  http://hachoir.org/wiki/hachoir-wx#Screenshots


Hachoir works any operating system and only depends on Python (2.4+).
Packages are available for Debian, Mandriva, Gentoo, Arch and FreeBSD.

hachoir-core goal is to ease binary parser writing. It takes care of
endian
problem, has bit resolution (for addresses and sizes), and only use
Unicode
charset for text. It gives a nice API to the programmer (see parsers
source
code): each field is an object. A parser is lazy: its value, display
string,
description, etc. is computed on demand (when the program ask it). So
it's
possible to parse very complex structures and huge files (60 GB or
more is
not a problem).

hachoir-core and hachoir-metadata are "fault tolerant": on parser/
extractor
error or file error (truncated or damaged file), the program doesn't
stop
but continue to next valid state. It allows to extract informations on
very
damaged files.

hachoir-metadata create a dictionary with typed values: track number
is an
integer, creation date is datetime.datetime object, etc. and all text
are
stored as Unicode string. The API allows easy reuse of extracted data.

Source code has good code coverage with automatic tests (lot of
testcases).
Fuzzing is sometimes used to find more bugs.

Some experimental programs exist like hachoir-strip: program to remove
personal information (author name, timestamp, copyright, etc.) from a
picture, movie, sound, archive, etc. Another example: swf_extract.py
allows
to extract pictures and sounds from a SWF (Flash) document.

Victor Stinner aka haypo

-- 
http://mail.python.org/mailman/listinfo/python-announce-list

        Support the Python Software Foundation:
        http://www.python.org/psf/donations.html

Reply via email to