Hello Eric, hello list,
On at 2024-07-26 13:04 +0200, Eric Auer via Freedos-devel wrote:
>
> Hi! News from BTTR:
>
>
https://www.bttr-software.de/forum/board_entry.php?id=20959&page=0&order=time&category=0
>
> while working on a single-file version of the EDR-DOS kernel,
> GPT partition support was discussed and ECM mentioned a new
> interesting tool: IDENT86.
This mail is now a few weeks old but I wanted to correct this. While the
thread in the forum was originally about single-file load, the
single-file load topic goes back to my first releases of it in late 2023
December and is not directly related to the JWasm port or the
identicalising tool ident86. (We have the bad habit of re-using forum
threads occasionally.)
The single-file load comes in several variants. One branch is based on
my lDOS boot stages. My builds [1] include four different kernel
variants based on this branch of development. Any single file of these
will act as a complete kernel. The variants are:
* Either using lDOS iniload (*.com named file, can be loaded as many
different formats including as DOS application) or lDOS drload (*.sys
named file, can only be loaded as an EDR-DOS/FreeDOS kernel file) as an
outermost wrapper.
* Either using lDOS inicomp (initial loader compression) stage (edrpack
named file), or not (edrdos named file).
* Also, the internal zerocomp compression of the drbio and drdos stages
may be enabled or disabled.
In addition, the inicomp stage may utilise a number of different
compression formats. The compression occurs at build time using the mak
script from the kernwrap repo (based on lDebug's mak.sh script). The
depacker is included in the inicomp stage, and is selected at build time
to match the format of the compressed payload. The default builds ship
files built with several alternative inicomp methods, in the tmp/
subdirectories of the build. These range from the super-fast zerocomp
(based on DR-DOS's original kernel compression) to the still fairly fast
LZSA2 and down to smaller resulting files using better ratio compression
methods such as Exomizer 3.x, APL, or LZMA-lzip.
The edrpack.* files in the bin/ subdirectory are currently selected from
the smallest method, which is LZMA-lzip. This can take several minutes
to depack on slow machines [2], which is the reason I added a progress
indicator that can be switched to one of several types using the
patchpro tool [3]. The orders of magnitude can be observed in the
results of the INICOMP_SPEED_TEST as well, which I did post in another
discussion [4] for several formats as run on our Debian Linux amd64
server running dosemu2 (no KVM). This ranges from 8ms per run (zerocomp)
to 488ms per run (LZMA-lzip), a factor of 61.
There is another branch of a single-file load, developed by Bernd
(Böckmann) [5]. This one works without the lDOS staged model, resulting
in a kernel that is only loadable as an EDR-DOS or FreeDOS kernel file
(similar to lDOS's drload), as well as limiting the choice of
compression. Only the zerocomp compression adapted from DR-DOS's
original kernel compression is currently available for this branch.
Lacking the lDOS stages overhead the kernel is smaller than the
uncompressed lDOS edrdos.* files, but not as small as the lDOS edrpack.*
files with better ratio compression. The kernel filename for this
"flavor" (as the repo's action artefacts call them [6]) is kernel.sys,
matching the FreeDOS kernel's conventional name.
> This is able to confirm that the
> JWASM port of the kernel is identical to a version made with
> another Assembler down to the single machine code instruction
> level, only leaving encoding differences without influence on
> semantics between the original and the JWASM port.
This is true, but ident86 has learned some additional tricks since.
An overview, as I still haven't written much of any documentation for
ident86:
The basic data item is a range of different bytes. Such a range is
typically bookended by at least 16 Bytes without a difference, both
before and after the range. The 16 Bytes length has been chosen because
a valid x86 instruction cannot exceed 16 Bytes in length.
When a range is being handled (in the function handlerange [7]) the
corresponding data from either file is fed to an lDebug instance running
in the background in a VM (qemu or dosemu2). Then this data is
disassembled. Instruction boundaries are found by referencing an
optional trace listing file, specified as the third file to ident86's
command line. The function disassemble [8] does some initial
postprocessing to drop irrelevant differences in the disassembly, such
as the MODRM keyword that indicates a certain operand encoding order,
expanding imms8 signed 8-bit immediates, and changing the
"segmented-address hexdump disassembly" format to a "file-seek length
disassembly" format.
Then the two disassemblies are compared. This usually proceeds line by
line, but sometimes multiple lines from one side may be matched to a
single line from the other side. Matching lines are hidden from display.
If an entire range is made up of matching lines, that is the processing
of the disassemblies ends up having matched all lines from both
disassemblies, then the "no difference" line is displayed. If a
difference is found, then the remaining disassembly lines (after any
possible leading matches) are displayed.
Some magic starts to happen when this occurs and the -s switch (side by
side view) is specified. In the side by side view, disassembly lines
from file 1 are displayed at the beginning of a line while lines from
file 2 are displayed at an offset of at least 40 columns to the right of
the beginning of a line. One display line may contain disassembly lines
from both files, or from only one of them or the other. ident86 will try
to sync up lines so disassemblies with the same starting address are paired.
If the address of a file 2 line matches the address of the paired file 1
line, then the file 2 line address is replaced by the keyword "same". If
all of the address and the length and the disassembly match, the file 2
line is replaced entirely by the text "samesame". If the disassemblies
compare as being similar in a fuzzy logic comparison, then the file 2
disassembly is marked with a comment reading "; fuzzysame".
The fuzzy logic is needed because we want to match lines that may have
slightly different addresses encoded into them (as immediate operands,
branch target operands, or address offsets) but encode the same meaning
of an instruction. These lines are usually uninteresting for
identicalising work, but may occur en masse if a later difference has a
differing length so that subsequent addresses (that may be referenced
from before that difference) are all shifted by a small number. The
first line that differs such that the line from file 2 is neither
"samesame" nor "fuzzysame" is marked as the earliest definitive
difference. This comes into play next.
ident86 ships with some logic to inspect an earliest difference to
figure out what change needs to be made to undo the difference. This
recognises differing length but (fuzzy) matching disassembly texts where
the shorter instruction is not followed by enough NOP instructions to
level the length difference, as well as missing or differing segment
override prefixes. ident86 will display a hint as to what change is needed.
When the -e switch and either or both -S or -E switches are used, and a
trace listing file is passed as the third filename, then the address of
the first hint displayed is crossreferenced with the trace listing to
obtain the source text file that needs to be edited. (The -p switch can
be specified once or multiple times to give regular expression patterns
that convert the "trace listing source" filenames to source text
filenames.) The -S switch will make it so the relevant part of the
source is displayed. With the -E switch specified, ident86 will go ahead
and edit the source appropriately. (This assumes that the trace listing
file and the source text are in NASM format.)
If the -r and -b switches are both specified along -e and -E, then after
an edit is done, ident86 will loop back to its beginning, re-build the
binary to be identicalised using the scriptlet specified with the -b
switch, then re-start the comparison of both files. This allows it to
automatically apply several edits in a row to aid in identicalising the
source text.
The -c switch allows to specify a cookie file, which will be used in
subsequent runs to skip all bytes before the prior run's earliest
definitive difference. These lines must have contained only samesame or
fuzzysame lines, so it is assumed that they would still match. It is
expected that after an -e -E -b -r -c run, the full resulting binary is
checked by another run without -c or -m (minimum offset to examine).
I'm using ident86, along with the fixmem script and associated NASM
macros, to port several programs to NASM. You may want to watch my blog
[9] to learn more about this.
> Creating a
> byte for byte identical version (which a binary checksum would
> be able to confirm) would have required manually enforcing the
> choice of encodings, which does not make code nicer, I think.
Yes, this is true. Especially as some assemblers, including my preferred
target of NASM (the Netwide Assembler), lack a way to specify which
register operand is to be encoded as the ModR/M operand. My debugger
does now allow to disassemble and assemble with a MODRM keyword [10] to
depict or enforce a particular order.
So to encode a non-default order of operands, NASM requires using db
(Define Data Bytes) directives to directly emit machine code bytes
rather than assembling mnemonic instructions. This is obviously not
desirable if an exacting byte-by-byte match is not required, hence
ident86. ident86 automates several tasks I used to carry out manually
when working to identicalise ports of assembly language programs.
> I guess there are several technology news bits in this thread
> which can be interesting for us :-)
Regards,
ecm
[1]: https://pushbx.org/ecm/download/old/edrdos/
[2]: https://github.com/SvarDOS/edrdos/issues/69#issuecomment-2244834392
[3]: https://pushbx.org/ecm/download/old/patchini/
[4]: https://github.com/SvarDOS/edrdos/issues/28#issuecomment-2248801739
[5]: https://github.com/boeckmann/
[6]: https://github.com/SvarDOS/edrdos/actions/
[7]: https://hg.pushbx.org/ecm/ident86/file/9acc3fd2c289/ident86.py#l1094
[8]: https://hg.pushbx.org/ecm/ident86/file/9acc3fd2c289/ident86.py#l345
[9]: https://pushbx.org/ecm/dokuwiki/blog/pushbx
[10]: https://pushbx.org/ecm/doc/ldebug.htm#asmref-nasm
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel