> Jose E. Marchesi wrote:
>>> Jose E. Marchesi wrote:
>>>     
>>>>> [...]
>>>>>             
>>>>>> I agree that distcheck is good but not a cure all.  Any static
>>>>>> system can be attacked when there is motive, and unit tests are
>>>>>> easily gamed.
>>>>>>                 
>>>>> The issue seems to be releases containing binary data for unit tests,
>>>>> instead of source or scripts to generate that data.  In this case,
>>>>> that binary data was used to smuggle in heavily obfuscated object
>>>>> code.
>>>>>             
>>>> As a side note, GNU poke (https://jemarch.net/poke) is good for
>>>> generating arbitrarily complex binary data from clear textual
>>>> descriptions.
>>>>       
>>> While it is suitable for that use, at last check poke is itself very
>>> complex, complete with its own JIT-capable VM.  This is good for
>>> interactive use, but I get nervous about complexity in testsuites,
>>> where simplicity can greatly aid debugging, and it /might/ be possible
>>> to hide a backdoor similarly in a poke pickle.  (This seems to be a
>>> general problem with powerful interactive editors.)
>>>     
>>
>> Yes, I agree simplicity it is very desirable, in testsuites and actually
>> everywhere else.  I also am not fond of dragging in dependencies.
>>   
>
> Exactly---I am sure that poke is great for interactive use, but a
> self-contained solution is probably better for a testsuite.
>
>> But I suppose we also agree in that it is not possible to assembly
>> non-trivial binary data structures in a simple way, without somehow
>> moving the complexity of the encoding into some sort of generator, which
>> will not be simple.  The GDB testsuite, for example, ships with a DWARF
>> assembler written in around 3000 lines of Tcl.  Sure, it is simpler than
>> poke and doesn't drag in additional dependencies.  But it has to be
>> carefully maintained and kept up to date, and the complexity is there.
>>   
>
> The problem for a compression tool testsuite is that compression
> formats are (I believe) defined as byte-streams or bit-streams.
> Further, the generator(s) must be able to produce /incorrect/ output
> as well, in order to test error handling.
>>> Further, GNU poke defines its own specialized programming language for
>>> manipulating binary data.  Supplying generator programs in C (or C++)
>>> for binary test data in a package that itself uses C (or C++) ensures
>>> that every developer with the skills to improve or debug the package
>>> can also understand the testcase generators.
>>>     
>>
>> Here we will have to disagree.
>>
>> IMO it is precisely the many and tricky details on properly marshaling
>> binary data in general-purpose programming languages that would have
>> greater odds to lead to difficult to understand, difficult to maintain
>> and possibly buggy or malicious encoders.  The domain specific language
>> is here an advantage, not a liability.
>>
>> This you need to do in C to encode and generate test data for a single
>> signed 32-bit NUMBER in an output file in a _more or less_ portable way:
>>
>>   void generate_testdata (off_t offset, int endian, int number)
>>   {
>>     int bin_flag = 0, fd;
>>
>>   #ifdef _WIN32
>>     int bin_flag = O_BINARY;
>>   #endif
>>     fd = open ("testdata.bin", bin_flag, S_IWUSR);
>>     if (fd == -1)
>>       fatal ("error generating data.");
>>         if (endian == BIG)
>>       {
>>         b[0] = (number >> 24) & 0xff;
>>         b[1] = (number >> 16) & 0xff;
>>         b[2] = (number >> 8) & 0xff;
>>         b[3] = number & 0xff;
>>       }
>>     else
>>       {
>>         b[3] = (number >> 24) & 0xff;
>>         b[2] = (number >> 16) & 0xff;
>>         b[1] = (number >> 8) & 0xff;
>>         b[0] = number & 0xff;
>>       }
>>
>>     lseek (fd, offset, SEEK_SET);
>>     for (i = 0; i < 4; ++i)
>>       write (fd, &b[i], 1);
>>     close (fd);
>>   }
>>   
>
> While that is a nice general solution, (aside from neglecting the
> declaration "uint8_t b[4];"; with "int b[4];", the code would only
> work on a little-endian processor; with no declaration, the compiler
> will reject it) a compression format would be expected to define the
> endianess of stored values, so the major branch in that function would
> collapse to just one of its alternatives.  Compression formats are
> generally defined as streams, so a different decomposition of the
> problem would likely make more sense:  (example untested)
>
>    void emit_int32le (FILE * out, int value)
>    {
>      unsigned int R, i;
>
>      for (R = (unsigned int)value, i = 0; i < 4; R = R >> 8, i++)
>        if (fputc(R & 0xff, out) == EOF)
>          fatal("error writing int32le");
>    }
>      Other code handles opening OUT, or OUT is actually stdout and we
>      are writing down a pipe or the shell handled opening the file.
>      (The main function can easily check that stdout is not a terminal
>      and bail out if it is.)  Remember that I am suggesting test
>      generator programs, which do not need to be as general as
>      ordinary code, nor do they need the same level of
>      user-friendliness, since they are expected to be run from scripts
>      that encode the precise knowledge of how to call them.  (That
>      this version is also probably more efficient by avoiding a
>      syscall for every byte written is irrelevant for its intended
>     use.)
>> This is the Poke equivalent:
>>
>>   fun generate_testdata = (offset<uint<64>,B> off, int<32> endian, int<32> 
>> number) void:
>>   {
>>     var fd = open ("testdata.bin");
>>     set_endian (endian);
>>     int<32> @ fd : off = number;
>>     close (fd);
>>   }
>>
>> And thanks to the DSL, this scales nicely to more complex structures,
>> such as an ELF64 relocation instead of a signed 32-bit integer:
>>
>>   fun generate_testdata = (offset<uint<64>,B> off, int<32> endian, int<32> 
>> number) void:
>>   {
>>     type Elf64_RelInfo =
>>       struct Elf64_Xword
>>       {
>>         uint<32> r_sym;
>>         uint<32> r_type;
>>       };
>>         type Elf64_Rela =
>>       struct
>>       {
>>         offset<uint<64>,B> r_offset;
>>         Elf64_RelInfo r_info;
>>         offset<int<64>,B> r_addend;
>>       };
>>         var fd = open ("got32reloc.bin");
>>     set_endian (endian);
>>     Elf64_Rela @ 0#B
>>       = Elf64_Rela { r_info = Elf64_RelInfo { r_sym = 0xff00, r_type = 3 } }
>>     close (fd);
>>   }
>>   
>
> Indeed Poke's language is excellent for concisely describing complex
> structures accurately, closely resembling C's syntax, but if I
> understand correctly, compression tool test data has few complex
> structures and is mostly byte-streams.
> Also, how did poke know to send the structure to "got32reloc.bin"?
> Was it simply the most-recently-opened file (or perhaps the top entry
> on a stack of recently-opened files)?  Were the OFF and NUMBER
> parameters intentionally unused, accidentally unused, or somehow
> implicitly used in the second example?

The map in the second example ougth to be:

  Elf64_Rela @ fd : 0#B
    = Elf64_Rela { r_info = Elf64_RelInfo { r_sym = 0xff00, r_type = 3 } }

>
> -- Jacob

Reply via email to