Hi Brett !
A quick analysis leaves me to believe that the culprit is in this commit:
revision 1.234
date: 2018-03-24 09:08:19 +0100; author: mlelstv; state: Exp;
lines: +176 -134; commitid: xU4Kh6YFLfDywGvA;
branches: 1.234.2;
Use separate lock to protect internal state and release locks when
calling biodone.
Here the logic for ST_EARLY_WARN got lost. So the EOM always delivers
EIO instead
of a 0 write count when EOM is reported by the drive and early warning
is enabled.
The early warning logic is described in st.4 as
EOM HANDLING
Attempts to write past EOM and how EOM is reported are handled
slightly
differently based upon whether EARLY WARNING recognition is enabled in
the driver.
If EARLY WARNING recognitions is not enabled, then detection of
EOM (as
reported in SCSI Sense Data with an EOM indicator) causes the write
operation to be flagged with I/O error (EIO). This has the effect for
the user application of not knowing actually how many bytes were read
(since the return of the read(2) system call is set to −1).
If EARLY WARNING recognition is enabled, then detection of EOM (as
reported in SCSI Sense Data with an EOM indicator) has no immediate
effect except that the driver notes that EOM has been detected. If the
write completing didn't transfer all data that was requested, then the
residual count (counting bytes not written) is returned to the user
application. In any event, the next attempt to write (if that is
the next
action the user application takes) is immediately completed with
no data
transferred, and a residual returned to the user application
indicating
that no data was transferred. This is the traditional UNIX EOF
indication. The state that EOM had been seen is then cleared.
In either mode of operation, the driver does not prohibit the user
application from writing more data, if it chooses to do so. This will
continue up until the physical end of media, which is usually
signalled
internally to the driver as a CHECK CONDITION with the Sense Key
set to
VOLUME OVERFLOW. When this or any otherwise unhandled error occurs, an
error return of EIO will be transmitted to the user application. This
does indeed mean that if EARLY WARNING is enables and the device
continues to set EOM indicators prior to hitting physical end of
media,
that an indeterminate number of 'short write returns' as described
in the
previous paragraph will occur. However, the expected user application
behaviour (in common with other systems) is to close the tape and
rewind
and request another tape upon the receipt of the first EOM indicator,
possibly after writing one trailer record.
dump abort on EIO. dump will switch tapes if it gets a zero write count.
Thus the 1.234 commit should be fixed with respect to EOM signalling.
Frank
On 06/09/21 02:47, Brett Lymn wrote:
Folks,
I don't perform a tape backup nor update this machine very often so it
has taken a while for me to spot this.
I backup to tape which takes a few tapes to complete, in the past this
has worked fine, when one tape is full dump recognises this and prompts
for a new tape.
I attempted a backup a couple of days ago and now dump says "write
error" and then asks if it should restart the dump, answering yes does
restart the dump from the beginning, answering no causes dump to exit.
As I said, this machine does not get updated often so I suspect this
problem has been there for a while. The kernel was built with v1.240 of
st.c, this version causes dump to misbehave. I reverted st.c back to
v1.231 (this was the version of st.c that was used in the kernel that
made the last successful backup). After adding a couple of FALLTHROUGH
comments to get v1.231 to compile I booted to this kernel and found that
dump behaved correctly again.
Given the above it looks like a change to st.c between v1.231 and v1.240
has broken multi-tape dumps. Fortunately most of the commits in that
bracket are cosmetic, one that does stand out is v1.238 which does
modify the tape position handling. I will try a kernel that
incorporates v1.237 of st.c and see what happens. Unfortunately,
testing is a very slow process as it takes about 3 hours to fill a tape
though I may be able to reduce that by using a lto-1 tape instead which
should halve the time taken to fill a tape.