[plex86] Preprescanning

Willow Schlanger Fri, 25 May 2001 04:32:50 -0700

NOTE: I write very poorly when my sleep schedule is all messed up. It's,
like, 2:40 AM here so my grammer will be a tad queer here. Also this is
a tad off topic; it has to do with a subproject of plex86 (and an
unofficial one, e.g. it will not be integrated into the main plex86
sources until _after_ it's working and _if_ Kevin agrees to it as a
plugin in the main distro).

Wouldn't it be nifty if all Plex86 guests communicated via TCP/IP? Then
one guest could be running on the same PC you're using; other guests
could be running on another PC; still others could be running on a PC
using MultiPlex86, a miniture Linux kernel that uses only legacy
hardware, has no display or keyboard support (in the kernel), but
provides just enough to run Plex86. Thus the plex86 client on that PC
can do hardware I/O that would normally be reserved to the Linux kernel.
That hardware I/O would be noted, but would "fall thru" to the real
hardware unless its a certain device that must be emulated (e.g. we
can't let it know about the actual network card since we're using it;
unless of course it knows about an emulated version of the card).

---
 WARNING THIS IS OFF TOPIC AND RAMBLING BUT TECHNICAL AND PERHAPS YOU'LL
FIND IT INTERSTING

The approach of Plex86 has been to look at code as it executes and to
convert or emulate the things that can't run natively and to run (most
of) the rest natively.

Wouldn't it be nifty if we could provide some kind of "hint mechanism"
that would take advantage of the target architecture.

E.g. take a Windows EXE program. It's fairly predictable: we can scan it
_before_ executing it, considering all routes. Only hardware I/O or
advanced things would be decided at the time what to do.

If we mapped a Windows EXE out like that, then we could find out which
parts of the code refers to other parts of the code (like Sourcer does).

Then we could actually "staticly translate" it so that we'd _see_ what's
going to run at CPL 0 and _translate_ it, in advance, to code that we
can actually run at CPL 0 (_after_ translation).

Thus if a CPL 0 code reloads CR3, it could be replaced with a call to
the WriteCR3 handler. The length of the call instruction could even be
longer than the MOV CR3,EAX or whatever instruction. We'd just adjust
all relative pointers that were detected.

Another idea / application / example.

We could also make it so that calls to the Windows kernel could be
"trapped" similiar to the way Plex86 traps "bad" instructions. Those
calls could, if the MultiPlex86 slave PC were runnig Windows, then cause
the slave PC, which _actually_is_ running Windows, to make a
corresponding call. The Slave PC would be running Windows
_inside_Plex86_.

The changes the slave PC makes to hardware would be noted by the slave
PC but not told to the guest / host PC. Only the _end_effect_ ("output")
of a Windows kernel function would be told to the host PC. That PC would
then simulate those effects.

IMHO, this is a way to clone Windows, module by module, without
violating its license agreement, and without separating its components.

For we could compare the changes made by a Windows function with the
changes Wine would make (Wine would be running in a second guest on the
host PC) and record, to a log file, any discrepencies (excluding things
that have to do with the internal operation of Windows).

To do this we'd map out the windows kernel functions and emulate
executing them, like a web crawler. If there were a

MOV EAX,5

instruction that at that place we'd store the _old_ value of EAX so that
we can walk BACKWORDS, undoing the modification.

Each register's bit would contain a pointer to its source.

All instructions would have 2 inputs: stdin, auxin and one output:
stdout.

For MOV EAX,5, stdin would be 5 and stdout would be EAX. There would be
no auxin.

Instructions can input flags, either CF, the low 8 bits of flags, or
other masks of flags (e.g. see 'POPF').

Now for ADD EAX,5, EAX would be the stdin because it's the _main_ input.
5 is the auxin because its auxillary; it MODIFIES ADD EAX but does not
fundamentally change what's being done. EAX is also the stdout.

Thus we'd make a linked list of these "operator notes" which would have
a pointer to an operator, the flag info, and the 3 pointers for stdin,
stdout, and auxin.

For rotations, auxin would be the rotation count.

Now the "operator nodes" would point to eachother: every input that is
not null, points to an output.

Thus:

a:
DEC EAX
JZ b
mov EAX,5
b:
INC EAX
INC EAX

consider the above code.

We wish to figure out what it does without having a human disassemble
it.

Have the machine do it for you.

First, consider doing this: emulating all possible courses.

Start off with EAX = unknown. Then we hit

a:      ; I will use ::EAX to refer to "what EAX was at a"
DEC EAX   ; EAX = ::EAX - 1
JZ b

We will assume that the JZ is not taken. This is possible because the
range of ::EAX is 0..ffffffff. Any range, plus or minus a number, stays
the unknown if it began unknown. Thus after subtracting one, the range
of ::EAX (the possible values it can have) is the same.

Now we hit the 'JZ b'. What we decide there, is if it is possible given
the current state of things, that EAX is zero. We know (our program
knows), in scanning the code, that ZF reflects the zeroness of EAX. And
it knows that EAX is ::EAX - 1, and that ::EAX has a range of 0..fffffff
(in the program it's stored like this: each _bit_ of eax points to a
place (the output of something) unless it can be anything at all, in
which case it's NULL, and a NULL pointer means "not restricted to a
certain state (0 or 1) at the state being conisdered"); also store a
range for _before_ the mask.

Thus we have done:

DEC EAX
JZ b

And we now put a restriction on what EAX can be. Now we know that EAX is
1..ffffffff.

Next we hit this:

mov EAX,5
INC EAX
INC EAX

And so there is an algorithm describing the operation of the function
(if it does hardware I/O, that I/O is considered simply to be changing
"registers" of that hardware. If that hardware does something in the
background, ficticious registers reflect that fact).

Thus we have a program that walks thru the code,

DEC EAX
JZ b
mov EAX,5
INC EAX
INC EAX

And we have the table entry made:

if ::EAX-1 != 0 then
 EAX = 7

Now we simplefy it doing arithmatic. ::EAX - 1 != 0 can be stated ::EAX
!= 1.

Input EAX       0,2..ffffffff
Output EAX      5

Thus we have reverse engineered a function. Before hand we have mapped
the function to find out where we'll be executing etc. We also know what
outputs of the function are actually used. FOr this example, let's say
that we've determined that the only thing this function can change that
actually gets used is EAX. Thus we don't worry about the flags that are
output.

BUT we're not done! The Input EAX range is 0..ffffffff and we have
covered only 0,2..ffffffff! We need still to do 1.

How do we do that? Well, as we were walking through the code, we were
"taking notes" on the way things were, so we can undo things. Remember
that operator map (linked lists)? Well when we walk up (back-up), we
readjust that, including restrictions.

Look at this:

a:
DEC EAX
JZ b
mov EAX,5
b:
INC EAX
INC EAX
d:

We're now at 'd'. We know that the sflags (except CF which is ::CF) are
the output of the last INC EAX, and we have EAX linked to be 7.

When we back up, past the INC EAX, we "undo" the sflags and decrement
EAX so it's now 6.

Now we're at b.

Go up still, and we're at the MOV EAX,5.

The node corresponding to that (the 'cluster') _stores_ the pointer for
all the bits of EAX that are overwritten (destroyed). Thus we walk back
up past the MOV EAX,5 so that EAX is now ::EAX - 1. And the restrictions
on EAX are: it can't be zero. The sflags (except CF) are now the output
of the DEC EAX.

Now if we encounter "INC EAX / INC EAX" we store in the operator linked
list "+ 1 + 1" but that's only done so that we can back up. EAX's state
is simplefied, so that its value would be, + 2.

Anyways.

Now we're at the 'JZ b' instruction.

That instruction stores the restrictions on EAX (what they were as we
were walking down the former time, when we there). Thus at the JZ b the
restirctions EAX _are_, are, that EAX is nonzero. But when we walk up,
we "undo" the restrictions (first).

Thus the restrictions become what they were, which is, 0..ffffffff.

Now we will continue walking up to the top, and then we're done, but we
are CRAWLING. Thus we first must expire all possibilities.

Thus before walking up to the first 'DEC EAX', we now switch to
walk-down mode! Why? Because the 'JZ b' does contains a pointer to the
next node, and yes, it's been taken, indicating we can walk but, EXCEPT
it also has a "branch" next pointer. It's a second program counter
target. _That_ one exists but hasn't been expired (taken).

So we now take the JZ b.

In taking it, we place these restrictions on EAX: it is zero. And EAX's
linked list, if you traveerse it and print it would, is ::EAX - 1.

If we were to simplefy it (don't simplefy it until you hit the end,
however), then we'd see that ::EAX - 1 = 0 and thus, ::EAX = 0. But we
don't know that yet; we just know that EAX is 0.

And thus we're now at 'b'.

EAX becomes 2.

Now we're at the end point, so we simplefy and add an entry to our table
of the outputs that get used (the flags don't get used, let's say; if
they did, we'd note the output flags too):

Input EAX       0,2..ffffffff   1
Output EAX      5               2

Next we walk up, doing the same thing, until we hit the starting point,
or we encounter an untaken node target (a branch). But the whole thing's
expired, so we'll hit the starting point! When you make it there, you're
done traversing or recrusiving or whatever.

And behold! That little table can be used as documentation to clone the
function.

Why clone a function? Because IMHO it can optimize things a hella lot,
and simultaniously promote interoperability. If the function used
Windows-specific things, we can have a compiler-like program convert the
table to a Linux source file, and there's no trace of where it came
from.

Although you could only do that with the outermost Windows kernel
functions.

Now you'd do the same thing with the inner Windows kernel functions,
too, but you would do that only to make a database of tables that can
then be used to clone the highest level functions.

Now the compiler - like program, which would take a table as input and
output C source code, would make references to functions/macros
_specific_ to the target architecture.

Thus, this would be done for _interoperability_only_.

If you want to clone Windows itself, you'd have to make a _superset_ OS
(like NT, in principle) what would have the ouput source code used only
for interoperability.

The making of the tables could benifiet from Plex86.

MultiPlex86 would be needed only for the lowest level function; namely,
things that interact directly with hardware I/O.

It had not occurred to me to begin making tables at the highest level
and then work down, going down as much as necessary.

SO! Want to know something cool?

The IBM BIOS will have functions that input, say, AH = 1, 2, or 3.

And they'll do this:

IBMFunction:
DEC AH
JZ IBMHandlerForFunction1
DEC AH
JZ IBMHandlerForFunction2
DEC AH
JZ IBMHandlerForFunction3
RET

Know what the Award clone'd funciton will do? Something like this:

AwardFunction:
CMP AH,1
JZ AwardHandlerForFunction1
CMP AH,2
JZ AwardHandlerForFunction2
CMP AH,3
JZ AwardHandlerForFunction3
SUB AH,3
RET

That's not something that they actually do but it is the same IDEA. E.g.
I saw something similiar to that.

Nifty, huh? SO if you input 4, to the IBM function (an invalid value)
you'll get out 1. And 5 will output 2.

Award could have called the function with AH = 1, 2, 3, and they know
what it does, and then said, "what if I call it with 4?"

And they'd get get 2.

Then they'd try all 256 values (actually, just enough to convince them
it's a linear function, y = x - 3) and then they'd have a table, like
so:

InputAH         0       4       5       6
OutputAH        -3      1       2       3

And so on, then they'd make a hypothesis (here, let's say they're
analysing invalid function numbers to see what the output is):

InputAH         N
OutputAH        N - 3

Thus, having simplefied the table, they'd then write their function,
first handling things as expected and then, if it's invalid, subtracting
3.

Then they _own_the_copyright_ for a 100% compatible function.

To be truly 100% compatible it'd have to be 100% identical (what about
if someone reads a byte of the ROM CODE and compares it with the
expected byte?) But that is 101% compatible not 100%, because _that_
would be incompatible with future IBM ROM BIOS versions!

Thus 100% compatbile is a tad loosely defined. It means, 'within
reason'.

To be truly 100% functionality compatible, they'd have to think about
the fact that CF is not the same on output. It falls thru for IBM's but
is the output of the SUB for Awards.

I've actually seen code like the above. It's nifty to compare clones
with the originals to think of a more systematic (e.g. a cheaper way
that doesn't involve so many engineers).

I think Award used a mapping program, though, and maybe it gave them
some hints, but in a clean room you need to be able to prove beyond a
reasonable dobt that you used something comparable to a true clean room.

How can we, in developing Plex86, promote interoperability between
technologies? I think 'Instrumentation' is a good start but I think we
should add some of these ideas.

Note that the program outputs a table for the IBM function and the award
function, automatically! No calling necessary! And we can't use the
target addresses of the jumps (don't tell any "humans". keep it private,
so the _PROGRAM_ knows where to go next, but not the _human_).

But if this is a public function we can use the table. We'd get this:

IBMFunction
::AH    *
AH      ::AH - 3

Along with info on the flags.

Thus we'd get the same table, _automatically_.

Some tables must be kept private, like I said, and used by the program
only. E.g. they can't be told to the cleanroom guys.

But some things, under supervision, could be input into the program that
generates C source code fragments.

It'd be:

        switch(AH) {
        case 1:
                Function2130841();
                break;
        case 2:
                Function2130842();
                break;
        case 3:
                Function2130843();
                break;
        default:
                AH -= 3;
        }

and NOT this:

        if(!--AH)
                Function2130841();
        else
        if(!--AH)
                Function2130842();
        else
        if(!--AH)
                Function2130843();

Some code can be used as-is after being approved. Some code WILL BE
WRITTEN, even without telling the cleanroom guys about it, because, to
get from "A to C" they'll realize they need a "C" and we'll say, "You're
Hot!" (meaning they're on the right track) and they'll develop a "B" on
their own (an internal thing). Maybe the thing being cloned used "B"
between A and C, maybe it used X. Doesn't matter.

The second output is an OPERATIONAL clone. The first, is a functionality
clone. The second is thus a decompiled version and the first, a
table-output-due-to-mapping version.

The first is the kind that can be used as-is (if it is for a public
function). The cleanroom guys will build up an architecture and the
"dirty guys" (who operate the program described above) would follow what
the cleanroom's architecture is and give their hot and cold responses
based on how things are developing.

Multiple cleanrooms could be used. Multiple levels could be used too (an
intermediate cleanroom that might closely resemble the source
architecutre. It'd _work_ and be good for internal use; e.g. study it to
make a plan for the next level).

Thus the above is just some thoughts I've got. It's where I'm headed. I
want to make LGPL tools to promote interoperability between Linux and
other systems. I hope that the tools will be so good that if Windows
isn't cloned, it's not my fault (for not making all the tools that
someone needs).

BTW sometimes I see IBM doing something like an 'AND' where AMI and
Award do a MOV; that has the same effect. Some fragments of code for AMI
and award are identical.

I classify BIOSes thus:
1. IBM BIOS
2. Award, AMI BIOS
3. PhoenixBIOS

Anyone know which of those are merged into one company?

If we make a 100% compatible LGPL BIOS, would anyone use it (e.g. PC
mfr's)?

The PhoenixBIOS is very different from the others.

All clone BIOSes are "piecemealed" together. Rather than having
different source files linked together, it looks like they use DOS COPY
or something, to copy all the modules together, and then fixup things
via jumps.

Thus it looks like they do,

CALL 5000h

to call a function in another source file. Then 5000h will have a JMP
instruction that jumps to wherever that function ends up.

Also Award uses a stupid assembler. it doesn't know that PUSH 0 can be
encoded as a sign extended byte. Weird, huh?

IMHO, I believe someone has developed what I have described above and I
belive Microsoft uses it or something similiar. It fits the facts
perfectly. In some cases MS buys companies. They bought Lattice, and
renamed their compiler Microsoft C; but: they bought a DOS clone; MS
loves COMPAQ who cloned an IBM BIOS... Pauld Allen (co-founder of MS)
developed an emulator for the Amiga or for something.... seem fishy?
Someone knows a LOT about cloning.

Now COMPAQ probably used primative reverse engineering, with two
separate, large, well-paid teams. They probably disassembled things. But
this method could enhance cost effectiveness a lot even for the
primative COMPAQ method.

Because it reduces the amount of machine code that humans need to
disassemble, even if you are to do that.

Windows can be disassembled but it'd be cleaner to NOT disassemble it
and ONLY used the better method I described above.

In addition to promoting interoperability we could also privide for a
new architecture, something that maps out a program, even a Mac or
Windows program, _BEFORE_ you run it, and then converts it to executable
code. It'd even handle self modified code as conditionals.

The tables can be conveted to pseudo-code too, which is probably
preferable.

Later, all.... I'm hoping for feedback on these ideas.. feel free to
pass this on to other newsgroups / people who might be interested...
I've implemented only the first version of what i've described here.

What my program does (that I actually wrote) is map out code based on an
entry point.

It stops when it hits an end of line.

It doesn't back up like I described above because what was described
above would be the _Second_ generation.

Instead it rescans the entire thing from the entrypoint (I have one
cluster for each opcode; beforehand I convert to clusters, considering
each byte a canadite for an entrypoint, noting its lenght, targets,
adjusting its relative targets, tec.)

Anyway when it hits an untaken branch it takes that route.

It assumes RETs are matched with CALLs, always where they aren't you
have to manually help it out.

When its done you have to look at all the end points.

Some of them will be for, for example,

JMP [1234+BX].

Now when its done, each cluster contains a linked list of all
instructions that can transfer to it. Thus blocks of code that have no
entrypoint in them can be grouped together and commented as pseudo code.
You have to manually see where flags get used or overwritten... my
current generation is probably  as clumbsy as compaq's.

Anyway for the JMP [1234+BX] you have to manually look at the
disassembled code, thus:

.
.
.
Line1: CMP BX,7
Line2: JAE somewhere_over_the_rainbow
Line3: ADD BX,BX
Line4: JMP [1234+BX]

Now when disassembling, you have to group things according to where
there are entrypoints. Since we see that, so far, there are no
entrypoints to Line3 or Line4, we know that the restrictions on BX is
that it is equal to twice a number that is between 0 and 6 inclusive.
Thus we have to manually insert code in the source that makes it so that
when we run the program again, and it uses the data from the last run,
when it hits that "endpoint" the special code says, "branch to
[1234+2*0]" unless that's been branched to, then to "branch to
[1234+2*1]", and so on.

Doing this method I finally mapped out a VGA BIOS (not the Elpin one,
don't worry) completely.

But then what? Ta hell if I'm gunna MANUALLY walk thru that shit and
take notes. I'd do it if I were _paid_ to, but I don't need to know
precisely how a VGA BIOS works; I just want to perfect the _TOOLS_ to
figure out how.

Thus the current generation is VERY clumbsy. I wrote that part of it in
a few days, maybe two days.

But there's another part to it. It uses a table of opcodes that gets
expanded from a script file I wrote that defines the x86 non-FP,MMX,etc.
instruction set, not in terms of ins, outs, axioms, operators, but just
in terms of machine code, instruction name, assembly language form.

I can thus make disassembler, and crude assemblers, automatically.

I posted the source for the script part as an attatchment to an earlier
message.

No one seemed to care about it.
[plex86] Preprescanning

Reply via email to