Re: [Freedos-devel] EDR-DOS news: Single-file load, JWasm port, ident86

E. C. Masloch via Freedos-devel Fri, 09 Aug 2024 08:36:38 -0700

Hello Eric, hello list,

On at 2024-07-26 13:04 +0200, Eric Auer via Freedos-devel wrote:
>
> Hi! News from BTTR:
>

>https://www.bttr-software.de/forum/board_entry.php?id=20959&page=0&order=time&category=0

>
> while working on a single-file version of the EDR-DOS kernel,
> GPT partition support was discussed and ECM mentioned a new
> interesting tool: IDENT86.

This mail is now a few weeks old but I wanted to correct this. While thethread in the forum was originally about single-file load, thesingle-file load topic goes back to my first releases of it in late 2023December and is not directly related to the JWasm port or theidenticalising tool ident86. (We have the bad habit of re-using forumthreads occasionally.)

The single-file load comes in several variants. One branch is based onmy lDOS boot stages. My builds [1] include four different kernelvariants based on this branch of development. Any single file of thesewill act as a complete kernel. The variants are:

* Either using lDOS iniload (*.com named file, can be loaded as manydifferent formats including as DOS application) or lDOS drload (*.sysnamed file, can only be loaded as an EDR-DOS/FreeDOS kernel file) as anoutermost wrapper.

* Either using lDOS inicomp (initial loader compression) stage (edrpacknamed file), or not (edrdos named file).

* Also, the internal zerocomp compression of the drbio and drdos stagesmay be enabled or disabled.

In addition, the inicomp stage may utilise a number of differentcompression formats. The compression occurs at build time using the makscript from the kernwrap repo (based on lDebug's mak.sh script). Thedepacker is included in the inicomp stage, and is selected at build timeto match the format of the compressed payload. The default builds shipfiles built with several alternative inicomp methods, in the tmp/subdirectories of the build. These range from the super-fast zerocomp(based on DR-DOS's original kernel compression) to the still fairly fastLZSA2 and down to smaller resulting files using better ratio compressionmethods such as Exomizer 3.x, APL, or LZMA-lzip.

The edrpack.* files in the bin/ subdirectory are currently selected fromthe smallest method, which is LZMA-lzip. This can take several minutesto depack on slow machines [2], which is the reason I added a progressindicator that can be switched to one of several types using thepatchpro tool [3]. The orders of magnitude can be observed in theresults of the INICOMP_SPEED_TEST as well, which I did post in anotherdiscussion [4] for several formats as run on our Debian Linux amd64server running dosemu2 (no KVM). This ranges from 8ms per run (zerocomp)to 488ms per run (LZMA-lzip), a factor of 61.

There is another branch of a single-file load, developed by Bernd(Böckmann) [5]. This one works without the lDOS staged model, resultingin a kernel that is only loadable as an EDR-DOS or FreeDOS kernel file(similar to lDOS's drload), as well as limiting the choice ofcompression. Only the zerocomp compression adapted from DR-DOS'soriginal kernel compression is currently available for this branch.Lacking the lDOS stages overhead the kernel is smaller than theuncompressed lDOS edrdos.* files, but not as small as the lDOS edrpack.*files with better ratio compression. The kernel filename for this"flavor" (as the repo's action artefacts call them [6]) is kernel.sys,matching the FreeDOS kernel's conventional name.


> This is able to confirm that the
> JWASM port of the kernel is identical to a version made with
> another Assembler down to the single machine code instruction
> level, only leaving encoding differences without influence on
> semantics between the original and the JWASM port.

This is true, but ident86 has learned some additional tricks since.

An overview, as I still haven't written much of any documentation forident86:

The basic data item is a range of different bytes. Such a range istypically bookended by at least 16 Bytes without a difference, bothbefore and after the range. The 16 Bytes length has been chosen becausea valid x86 instruction cannot exceed 16 Bytes in length.

When a range is being handled (in the function handlerange [7]) thecorresponding data from either file is fed to an lDebug instance runningin the background in a VM (qemu or dosemu2). Then this data isdisassembled. Instruction boundaries are found by referencing anoptional trace listing file, specified as the third file to ident86'scommand line. The function disassemble [8] does some initialpostprocessing to drop irrelevant differences in the disassembly, suchas the MODRM keyword that indicates a certain operand encoding order,expanding imms8 signed 8-bit immediates, and changing the"segmented-address hexdump disassembly" format to a "file-seek lengthdisassembly" format.

Then the two disassemblies are compared. This usually proceeds line byline, but sometimes multiple lines from one side may be matched to asingle line from the other side. Matching lines are hidden from display.If an entire range is made up of matching lines, that is the processingof the disassemblies ends up having matched all lines from bothdisassemblies, then the "no difference" line is displayed. If adifference is found, then the remaining disassembly lines (after anypossible leading matches) are displayed.

Some magic starts to happen when this occurs and the -s switch (side byside view) is specified. In the side by side view, disassembly linesfrom file 1 are displayed at the beginning of a line while lines fromfile 2 are displayed at an offset of at least 40 columns to the right ofthe beginning of a line. One display line may contain disassembly linesfrom both files, or from only one of them or the other. ident86 will tryto sync up lines so disassemblies with the same starting address are paired.

If the address of a file 2 line matches the address of the paired file 1line, then the file 2 line address is replaced by the keyword "same". Ifall of the address and the length and the disassembly match, the file 2line is replaced entirely by the text "samesame". If the disassembliescompare as being similar in a fuzzy logic comparison, then the file 2disassembly is marked with a comment reading "; fuzzysame".

The fuzzy logic is needed because we want to match lines that may haveslightly different addresses encoded into them (as immediate operands,branch target operands, or address offsets) but encode the same meaningof an instruction. These lines are usually uninteresting foridenticalising work, but may occur en masse if a later difference has adiffering length so that subsequent addresses (that may be referencedfrom before that difference) are all shifted by a small number. Thefirst line that differs such that the line from file 2 is neither"samesame" nor "fuzzysame" is marked as the earliest definitivedifference. This comes into play next.

ident86 ships with some logic to inspect an earliest difference tofigure out what change needs to be made to undo the difference. Thisrecognises differing length but (fuzzy) matching disassembly texts wherethe shorter instruction is not followed by enough NOP instructions tolevel the length difference, as well as missing or differing segmentoverride prefixes. ident86 will display a hint as to what change is needed.

When the -e switch and either or both -S or -E switches are used, and atrace listing file is passed as the third filename, then the address ofthe first hint displayed is crossreferenced with the trace listing toobtain the source text file that needs to be edited. (The -p switch canbe specified once or multiple times to give regular expression patternsthat convert the "trace listing source" filenames to source textfilenames.) The -S switch will make it so the relevant part of thesource is displayed. With the -E switch specified, ident86 will go aheadand edit the source appropriately. (This assumes that the trace listingfile and the source text are in NASM format.)

If the -r and -b switches are both specified along -e and -E, then afteran edit is done, ident86 will loop back to its beginning, re-build thebinary to be identicalised using the scriptlet specified with the -bswitch, then re-start the comparison of both files. This allows it toautomatically apply several edits in a row to aid in identicalising thesource text.

The -c switch allows to specify a cookie file, which will be used insubsequent runs to skip all bytes before the prior run's earliestdefinitive difference. These lines must have contained only samesame orfuzzysame lines, so it is assumed that they would still match. It isexpected that after an -e -E -b -r -c run, the full resulting binary ischecked by another run without -c or -m (minimum offset to examine).

I'm using ident86, along with the fixmem script and associated NASMmacros, to port several programs to NASM. You may want to watch my blog[9] to learn more about this.


> Creating a
> byte for byte identical version (which a binary checksum would
> be able to confirm) would have required manually enforcing the
> choice of encodings, which does not make code nicer, I think.

Yes, this is true. Especially as some assemblers, including my preferredtarget of NASM (the Netwide Assembler), lack a way to specify whichregister operand is to be encoded as the ModR/M operand. My debuggerdoes now allow to disassemble and assemble with a MODRM keyword [10] todepict or enforce a particular order.

So to encode a non-default order of operands, NASM requires using db(Define Data Bytes) directives to directly emit machine code bytesrather than assembling mnemonic instructions. This is obviously notdesirable if an exacting byte-by-byte match is not required, henceident86. ident86 automates several tasks I used to carry out manuallywhen working to identicalise ports of assembly language programs.


> I guess there are several technology news bits in this thread
> which can be interesting for us :-)

Regards,
ecm


[1]: https://pushbx.org/ecm/download/old/edrdos/
[2]: https://github.com/SvarDOS/edrdos/issues/69#issuecomment-2244834392
[3]: https://pushbx.org/ecm/download/old/patchini/
[4]: https://github.com/SvarDOS/edrdos/issues/28#issuecomment-2248801739
[5]: https://github.com/boeckmann/
[6]: https://github.com/SvarDOS/edrdos/actions/
[7]: https://hg.pushbx.org/ecm/ident86/file/9acc3fd2c289/ident86.py#l1094
[8]: https://hg.pushbx.org/ecm/ident86/file/9acc3fd2c289/ident86.py#l345
[9]: https://pushbx.org/ecm/dokuwiki/blog/pushbx
[10]: https://pushbx.org/ecm/doc/ldebug.htm#asmref-nasm


_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Re: [Freedos-devel] EDR-DOS news: Single-file load, JWasm port, ident86

Reply via email to