For PCAP files, the "magic number" is modeled as data because there are two 
distinct values of it. One indicates the file is bigEndian, the other 
littleEndian.

But for files where there is just a constant couple bytes like ascii "MZ", it 
depends on whether you ever want to go on to parse the file even if this MZ is 
incorrect.

The only case I can think of for that is obscure. Polyglot files - e.g. a file 
that is both a PDF and a JPEG and a jar file, etc. There are files that satisfy 
the requirements for multiple such formats. A file that parses as more than one 
file is of course highly suspect and likely dangerous in some way, but if you 
can't get past the incorrect magic number you can't even try to parse it as 
something else, so you can't find out if it is polyglot or not. So modeling the 
magic number as data lets you decide to validate it, but doesn't prevent the 
parse from proceeding.

Modeling it as an initiator,... stops the parse (or causes backtrack) if it 
doesn't match the expected value.

Maybe there's other cases.


________________________________
From: Costello, Roger L. <[email protected]>
Sent: Thursday, September 19, 2019 6:38 AM
To: [email protected] <[email protected]>
Subject: Best Practice: "magic number" is an initiator or data?

Hello DFDL community,

Magic number: A constant numerical or text value used to identify a file format.

The start of every Windows EXE file is the magic number MZ.

Should it be modeled as an initiator for the entire file or should it be 
modeled as 2 bytes of character data? What are the pros and cons of the two 
modeling approaches? What is best practice?

/Roger

Reply via email to