Hi list,
On at 2022-07-12 16:29 +0000, Bret Johnson wrote:
For TSR's, there are additional things you can do to reduce memory. You can
look at the source code for my PRTSCR program (available at
http://bretjohnson.us) that uses a BUNCH of tricks. For example, it doesn't
even use the DOS TSR interrupt.
The way it works is to make a "copy" of itself at the top of conventional memory, terminates itself (using a normal DOS
terminate process, which includes deleting the original PSP), and then continues running from the "copy". The
"copy" decides where the best place in memory is to load the TSR (which can even be in one or more "memory
holes" left by some other program or in upper memory), allocates appropriate memory block(s), and then installs itself in
the allocated memory. I learned that technique from ECM a long time ago. It's much more complicated than a "normal"
TSR installation, but is much more efficient in terms of ultimate memory use.
I developed this optimal installation handling first for RxANSI, based
on Henrik Haftmann's ANSI, which I forked starting in 2008 [1]. I
eventually adapted it into the TSR example [2] and then also for some
other TSRs such as lClock [3].
To note is that the resident block installed this way doesn't have a PSP
at all, just an MCB with itself as the owner and a DOS v4+ style MCB
name for MEM type programs.
Other notes on your TSR and application:
Your GRAFTABL uses interrupt 21h service 31h to stay resident. You do
free the environment, but to be on the safe side it is better to also
clear the process's environment field to zero. Example code is in my TSR
example [4]:
xor ax, ax
xchg ax, word [ cs:2Ch ]; set PSP field to zero
mov es, ax
mov ah, 49h
int 21h ; Free our environment
Further, you calculate the size of the PSP block to keep resident by
using test, add, and shifts:
mov dx, trans ; Allow everything preceding the
test dx, 0x000F ; transient portion of
GRAFTABL to
jz .20 ; remain resident. (Rounded up to
add dx, 0x10 ; the next paragraph, because
MS-DOS
.20: mov cl, 4 ; wants the size in paragraphs.)
shr dx, cl
mov ax, 0x3100
int 0x21 ; TSR EXIT CODE 0
It is more efficient to change this like so:
mov dx, trans
add dx, 15
mov cl, 4
shr dx, cl
The "add dx, 15" makes the shr round up in its division. (I expect that
the addition won't overflow here.) However, you can easily change it so
that the addition is done by the assembler at build time, using "mov dx,
trans + 15".
Moreover, while it is not supported in the most obvious way (like "mov
dx, (trans + 15) / 16") you can teach NASM to do the entire calculation
(shift or division and all) by calculating a scalar length of the
program (as opposed to a relocatable symbol like created by your
"trans:" label). There's an example of this in the DVORAK TSR (under GNU
GPL v2+) by Donald Bindner [5] that goes as follows:
even 16
end_of_resident equ ($-$$ + 0100h)
...
mov dx, end_of_resident/16 ; number of paragraphs to keep
mov ax, 3100h ; terminate resident w/ 0 return code
int 21h
As you can see, NASM allows to divide the scalar value. (Rounding up is
not needed in this particular calculation because they already aligned
the position of the end_of_resident equate to a 16-byte boundary.)
Here is another example [6], in FDAPM (by Eric Auer) which I extended
with some but not all of my TSR ideas:
mov dx, (eofTSR - start + 256 + 15) >> 4
; +256 for PSP, start is at offset 100h
mov ax,3101h ; go TSR, errorlevel 1
int 21h
In my TSRs and other applications I use more sophisticated calculations,
some much more so. These are often based on different sections or
segments of the program. I generally calculate deltas to address
different parts and hardcode resulting values into the program at build
time. However, the suboptimal way of calculating the amount of
paragraphs at run time is a very common oversight.
Another problem (which is also little known) is that your use of
interrupt 21h service 31h will retain all of your currently open process
handles, as well as the entire system's System File Table (SFT) entries
associated with these. This is not a problem by default because all your
handles will be DUPlicated from the parent's, for your stdin, stdout,
stderr, stdaux, and stdprn. That means they will share the same SFT
entries as already used by the shell. However, if the user runs your
program with output redirection (either to a file or a character device
such as NUL), as in "graftabl > nul", then you will leak the SFT entry
which was reserved for your process to use.
I assume that the people involved in the design of this DOS service
expected that TSRs would generally want to keep around their PSPs, so
that they could swap processes and then use their own handles as
preserved in their process handle tables. However, in practice most TSRs
never re-use their PSPs after the DOS TSR termination handling is done.
So in that case, as it is for your application, you should explicitly
free all handles before terminating.
Here's how I solved it [7] in FDAPM (and with equivalent code in FreeDOS
SHARE):
xor bx, bx ; = 0
mov cx, word [32h] ; get amount of handles
.loop:
mov ah, 3Eh
int 21h ; close it
inc bx ; next handle
loop .loop ; loop for all process handles -->
Further, you can certainly re-use the "zero page" (PSP) space starting
at offset 80h, which holds the command line tail by default but is also
the default DTA for your process. That latter fact is a big hint that
DOS doesn't need this buffer to be preserved. Even more space at the
tail of the PSP can be re-used. DOS enforces a minimum resident size for
the service 31h PSP allocation of 60h bytes [8], and the only known uses
of the space after that is for the default unopened FCBs [9]. Even the
additional 16 bytes down to 50h are probably fine to be overwritten.
Some more notes on your TSR:
In your executable entrypoint you have a near jump to skip the buffer
later used for the table data:
entry: jmp trans
db 1021 dup 0x00
It is implied by the size of the buffer that the jump must be near, so
it takes 3 bytes, and then you add another 1021 bytes to get a total of
1024 bytes. (I think the "dup" syntax is only supported by recent
versions of NASM, but that's not important.) However, I'd prefer to use
some calculation to get NASM to reliably fill the buffer, such as:
entry: jmp trans
times 1024 - ($ - entry) db 0
Alternatively, you could use my fill macro [10] like this:
entry: fill 1024, 0, jmp trans
(Also, as I think you already suggested in this thread, for optimising
the transient executable size you could put one of the tables into this
buffer to save 1 KiB at the end of the executable. (Just the jump needs
to stay, you could re-initialise it to hold the correct 3byte for the
table start later.) Or stash some of the messages in there, as long as
they're shorter than the 1 KiB size.)
You do not use an IBM Interrupt Sharing Protocol (IISP) [11] header for
your interrupt hook. Therefore, you could optimise this part a bit, from:
old2F: dd 0xFFFF0000
...
.1: jmp far [cs:old2F]
Into this:
.1: jmp 0:0
old2F equ $ - 4
This is some self-modifying code (SMC) to stash the downlink into an
immediate far jump instruction, instead of using the indirect far jump
to refer to a different memory location in your code segment. The dollar
sign is used in the equate to denote the current assembly position after
the 5-byte instruction; it is offset by minus four so as to address the
far pointer in the instruction's encoding. (As mentioned, you cannot do
this if you use a standard IISP header, because that has a "jmp short
$+18" (EBh 10h) instruction right in front of the downlink field.)
Next, you're using "or al, al" to check a register for zero. However, it
is more idiomatic [12] to use "test al, al" instead, which right away
hints (to a reader or even to the processor) that no change of the
register occurs.
Additionally, you're using two instructions of the test or compare types
to dispatch down three different paths. (These are handling the function
00h call, or the function 01h call, or anything else.) One comparison
instruction suffices to do that however. Observe:
new2F:
...
cmp al, 1
ja .chain
je .function_01
.function_00:
Other than your earlier use of "dup" syntax, you are also using "mov
word ds:[bx], entry". The segment override outside the brackets is
another MASMism that NASM has recently learned to support. I prefer the
segment override within the brackets however.
PRTSCR also includes the ability for the TSR to allocate memory blocks in
Expanded Memory (EMS) or Extended Memory (EMS, though this happens indirectly
through the use of DOS Protected Mode Services or DPMS).
I think you meant XMS as the abbreviation of "Extended Memory" here, Bret.
Using these techniques, you can actually have a complicated TSR that requires
LOTS of data but only a small part of the data (and code) requires the use of
conventional (or even upper) memory.
I'm still experimenting with the EMS & DPMS things so don't think that part is
necessarily "good to go", but it is something you can experiment with if you want.
I'm also converting the code from A86 to NASM, and the code on the web site is in A86
(actually, A386) format so you would need modify it to work with some other assembler.
One of my applications, the lDebug debugger (with a small "L"), will use
XMS for two features: The video screen swapping recently copied from
FreeDOS Debug, and the symbolic debugging support that is not yet
included in the builds of lDebug I prepare on our server.
The symbol tables for the symbolic option may require lots of memory. I
capped this at 256 KiB for 86 Mode memory (ie, the first 1024 or 1088
KiB, as addressable directly in Real or Virtual 86 Mode), but the XMS
possibility supports symbol tables up to the maximum of 2 MiB (plus
transfer buffer), which maxes out the 16-bit indices used to refer to
the symbol main, hash, and string data. I'm approaching the 64 KiB
segment limit both for my code segment and my entry/data/stack segment,
so it is a very good thing to not cram the symbol tables into that. Even
using additional segments, there is no way to fit 2 MiB into the 86 Mode
memory.
The way I support XMS is by defining a small (260 bytes) buffer in the
86M memory of my data segment, called the access slice. Any access to
the symbol tables (either main, hash, or string entries) goes through
some functions that take an index to read from the symbol tables and
return a far pointer. (The reason for not hardcoding the access slice
address, and using the pointer instead, is to allow addressing 86 Mode
memory symbol tables directly while XMS symbol tables use the access
slice method, without duplicating all the "business logic" for each
method.) The appropriate data is copied from the XMS allocation's space
into the access slice. If the application wants to modify some part of
the symbol tables, it first requests the access slice be filled, then
modifies the data in the access slice, and then calls another function
to copy back the changes from the access slice to the symbol tables.
Regards,
ecm
[1]: https://hg.pushbx.org/ecm/rxansi
[2]: https://hg.pushbx.org/ecm/tsr
[3]: https://hg.pushbx.org/ecm/lclock
[4]: https://hg.pushbx.org/ecm/tsr/file/daca203fa216/transien.asm#l962
[5]: https://sand.truman.edu/~dbindner/freeware/
[6]:
https://hg.pushbx.org/ecm/fdapm/file/62a7d769a9f6/source/fdapm/fdapm.asm#l154
[7]:
https://hg.pushbx.org/ecm/fdapm/file/62a7d769a9f6/source/fdapm/fdapm.asm#l146
[8]:
https://retrocomputing.stackexchange.com/questions/20001/how-much-of-the-program-segment-prefix-area-can-be-reused-by-programs-with-impun/20006#20006
[9]: https://fd.lod.bz/rbil/interrup/dos_kernel/2126.html
[10]: https://hg.pushbx.org/ecm/lmacros/file/9fa0e64034cd/lmacros1.mac#l916
[11]: https://fd.lod.bz/rbil/interrup/tsr/2d.html
[12]:
https://stackoverflow.com/questions/33721204/test-whether-a-register-is-zero-with-cmp-reg-0-vs-or-reg-reg
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel