For PCAP files, the "magic number" is modeled as data because there are two distinct values of it. One indicates the file is bigEndian, the other littleEndian.
But for files where there is just a constant couple bytes like ascii "MZ", it depends on whether you ever want to go on to parse the file even if this MZ is incorrect. The only case I can think of for that is obscure. Polyglot files - e.g. a file that is both a PDF and a JPEG and a jar file, etc. There are files that satisfy the requirements for multiple such formats. A file that parses as more than one file is of course highly suspect and likely dangerous in some way, but if you can't get past the incorrect magic number you can't even try to parse it as something else, so you can't find out if it is polyglot or not. So modeling the magic number as data lets you decide to validate it, but doesn't prevent the parse from proceeding. Modeling it as an initiator,... stops the parse (or causes backtrack) if it doesn't match the expected value. Maybe there's other cases. ________________________________ From: Costello, Roger L. <[email protected]> Sent: Thursday, September 19, 2019 6:38 AM To: [email protected] <[email protected]> Subject: Best Practice: "magic number" is an initiator or data? Hello DFDL community, Magic number: A constant numerical or text value used to identify a file format. The start of every Windows EXE file is the magic number MZ. Should it be modeled as an initiator for the entire file or should it be modeled as 2 bytes of character data? What are the pros and cons of the two modeling approaches? What is best practice? /Roger
