The following is my thoughts on the subject of developing software that processes binary files and the need for a language that describes binary formats.


File formats are varied and each require different software for reading and manipulation. For some formats such as SGML descendants, comma or line separated text files, etc. there usually is software that can handle a family of similar formats. e.g. Any tag-based reader software can easily be made to handle XML, HTML, MATHML and other variants with little modification.
Other formats, usually ones that are binary and compressed, have less of a family resemblance between them when it comes to their internal structure. For example, two different image formats can be very similar in capabilities and hold similar information, yet no two can share the same reader software. Image manipulation software therefore has to be able to handle all the different formats and their variations in separate bits of code, though most of them share the same basic structural ingredients (i.e. headers, data buffers, what have you).


Specifications of binary formats are usually given in the form of c-structs that denote different types of data chunks and their relative order and relation within the file/document. These basically tell the programmer the location via size and offset of fixed size data fields. This information is used to extract data in a predetermined fashion from a file and then, based on that data, know how to extract the rest of the information.

There is a need for a single *standard* binary format description language. In this language could be written binary format descriptions which instruct software how to read arbitrary formats and manipulate files, much like XML can be used to describe many text based formats that exist today. In fact XML seems to me ideal for the task. Given such a language, generic software may be written that can process any type of binary file and present it or manipulate it. For example, image manipulation software developers can be handed the format description of a certain image format and be able, with little additional coding efforts, to add support for that format in their software. Such support may even be added on-the-fly for some types of software in a way similar to how codecs work today in audio and video software.

The main advantage of this language would be for the developing community. It will allow a repository of all known and public binary formats to exist which make possible the development of generic software that processes binary files. Such generic software may be the code base of some higher, more domain specific software. This would ease development efforts and allow more interoperability between existing supported formats and new immerging formats.

An example of how a binary format description language file might look (for an imaginary image format - 'beatmap'):

<format name=beatmap>
   <header mandatory>
      <file-size size=8 big-endian/>
      <color-depth size=4/>
      <width size=8/>
      <height size=8/>
      <compression>
         ...
      </compression>
      <data-chunk-size>1024</data-chunk-size>
   </header>
   <data-chunk optional multiple>
</format>

A generic binary file editor, given this format specification, can know how to handle 'beatmap' files. It knows that the first 8 bytes denote the size of the file. It knows exactly from where and how the rest of the header information can be extracted and it knows how to handle the rest of the file: how to fetch the data chunks and how to process them. This editor may display to the users a tree representing the structure of the binary files so that they may analyze it.
Developers can use such generic binary format processing code to add support for 'beatmap's in their own software. Since the data is already processed by the lower level code and presented to them in as objects in their programming language all that remains to be done is to actually process the data - decompress/manipulate/display it.
Format conversion software can easily be designed and developed using similar files that describe how two formats differ and what should be done to convert a file in one to the other.


With today's proliferation of binary formats used to pass on different types of information in many ways distinct ways, a common language should exist that will be able to describe them all - for the benefit of users and developers of software. This would be a significant step in the standardization effort of digital information.

Assaf Lavie



Reply via email to