Re: [Bacula-users] Tape full before it should.

Alan Brown Thu, 19 Jan 2017 07:47:52 -0800

On 19/01/17 09:50, Francisco Javier Funes Nieto wrote:

Hi !
One of the most illustrative and useful mail threads for those Who usetape/tapedrivers/autochangers in our backup schemas.


Thanks Francisco

As an additional data point, "sg_logs -a /dev/nst0" (or the matching/dev/sg) will give you a LOT of current data from the drive.


Sticky mail.

Thanks to all.

El 19 ene. 2017 4:27 a. m., "Gi Dot" <gadi...@gmail.com<mailto:gadi...@gmail.com>> escribió:


    Kern, Alan,

    Thanks for the advice. A bit over the top for me to digest, but
    I'll work on it.

    On Mon, Jan 9, 2017 at 11:04 PM, Alan Brown <a...@mssl.ucl.ac.uk
    <mailto:a...@mssl.ucl.ac.uk>> wrote:

        On 09/01/17 13:45, Kern Sibbald wrote:

        Hello,

        The status Bacula received was -1, which means that the tape
        drive reported a hardware end of tape (i.e. an end of tape
        marker was seen. This can happen for the following reasons:

        1. You reached the hardware end of tape marker at 150GB, but
        the marker was placed in the wrong place on the tape when it
        was manufactured.  I.e. the tape cassette is defective.


        Kern, that's not a good interpretation of the problem.


        LTO tapes don't have a "hardware end of tape marker" as you
        might expect with DAT or other older unidirectional tapes.


        Because of the serpentine layout of the tape, the beginning of
        the tape is also the end of the tape and the servo track
        (factory written and unchangeable) contains "offset distance
        from end of the reel" information.

        Serpentine means:

        1: The tape winds to the end of the reel, heads move slightly
        (onto the next track) and then the tape winds back into the
        cartridge.
        2: The heads move to the next track again.
        3: This process is then repeated until the last track pair is
        completed.
        4: Data is written to the tape in both directional passes.

        When the end of the last track is reached, the tape has been
        wound back into the cartridge.

        What this means is that the maximum seek time is approximately
        half of one track length (~900 metres) and that's around 35GB,
        even if you're seeking several hundred GB into the tape - ie:
        Whilst the seek command is a linear offset, actual seeking on
        a LTO is 2-dimensional - "N track and X offset". The tape's
        internal chip records the 2D location of files and data
        blocks, so that there's never any need to linearly seek along
        all tracks from the start of the tape.

        LTO heads are constructed so that drives do read-after-write
        verification on the fly in both directions. A bacula
        verification pass is normally unncecessary because detected
        errors result in the data being rewritten to the tape immediately.

        If there are errors, the drive will attempt to rewrite the
        data several times.(*) If all rewrites fail then it will flag
        an uncorrectable error - "The tape is bad and should be
        discarded"(**). Bacula interprets this as an end-of-tape error


        (*) This means that errors on a tape result in 2 effects if
        there are a lot of errors

           1: There's a massive slowdown in reported despooling speed
        for jobs and tape "full" capacity is reduced somewhat from the
        theoretical values (somewhere between 90%-250% of
        _uncompressed_ capacity would be a normal tape)

           2: When reading the tape's RFID chip, it will say that they
        tape is somewhere between "97"-"99"% full, but the total
        amount of data it says has been written since last labelled is
        significantly less than the _uncompressed_ value of the tape.

        (**) The same effect will occur if the heads are dirty or
        damaged - and it DOES happen(***). Once a contaminated tape
        finds its way into a drive and fouls the heads you can pretty
        much guarantee that all subsequent tapes will have reported
        problems, but until the heads are cleaned or repaired you
        won't know if the tapes are wrecked or OK.

        (***) We had a bad batch of HP LTO5s contaminate multiple
        drives before we realised what was happening. We're still
        cleaning up the mess 3 years later.


        Drive error codes actually indicate "drive problem", "tape
        problem" or "unable to work out which is the problem", but the
        effect is the same as far as bacula's concerned. There are a
        slew of other error codes.



        LTO tapes wear out rapidly with repeated use. The lifespan of
        a LTO tape is claimed to be "up to" 162 complete writes but in
        reality it's more like 10-20% of this number before
        degradation is significant. We're seeing tapes with 20-30
        write cycles down to 60% of original capacity and thanks to
        rewrites the despool speeds are _very slow_.


        Apart from interrogating the tape drive and tape cartridge
        chip (Kern and I have been discussing how to handle this on
        the fly), Despooling speed is a critical indicator of tape
        health. If it suddenly drops off, this is cause for alarm.


        2. You are using some tape driver (e.g. the ibm tape driver)
        rather than the Linux st tape driver.  The ibm tape driver
        does not work correctly with Bacula.


        Having encountered this problem, the described issue is not
        consistent with the IBM driver error (which comes form "ERROR
        0: Success" messages).

        In the case of a IBM driver, the tape can be labelled and
        written quite happily. Problems occur when attempts are made
        to seek to EOD on a tape with _existing_ data - the error 0
        message fools bacula into thinking the operatiopn has failed.


        My opinion:

        The error reported and the fact that it took 31 minutes to
        write 150Gb before erroring out points to fouled heads.
        Load a cleaning tape(****) and try writing a new tape.
        If that writes ok, then discard the errored tape (and possibly
        the one before that). If not then the drive will need
        return-to-base repairs and the test tape/last tape and one
        before that should be discarded.

        (****) NEVER share a cleaning tape between drives. Yes I know
        that's what libraries do with dedicated cleaning tape slots,
        but it's a really fast way of cross-contaminating hardware.
        Don't do it.


        If you don't have a LTO tape cartridge reader (www.mptapes.com
        <http://www.mptapes.com>), then the next best thing is to
        ensure you have the latest version of sg3_tools installed, and
        use sg_read_attr to interrogate the chip.


        You should also install the IBM or HP drive management tools
        (even if this means installing windows) and interrogate drive
        health.


        tapeinfo and loaderinfo utilities are useful but incomplete
        for this kind of diagnosis.


        I've been working through the various sg attribute pages
        trying to see which ones are useful. Drives actually log a
        _large_ amount of data internally about the last few hundred
        tapes used, but unless you ask the right questions you won't
        get any answers out of them (HP and IBM drive tools ask those
        questions, of course - and know how to interpret the answers)

        Best regards,
        Kern

        On 01/09/2017 04:29 AM, Gi Dot wrote:

        Hi all,


        At the data centre we are using IBM-LTO tape - 3.0TB
        compressed, 1.5T uncompressed. Last 2 nights a backup was
        running and it stopped at about 150GB size and bacula marked
        the tape as full.

        Since the total amount of backed up data sometimes could be
        huge, I have purged the volume straight away before the tape
        was inserted. There is a total of 10 jobs, and the first job
        holds the biggest data, somewhere around 500GB to 2TB at a
        time. Backup failed at the first job, at 150GB size.

        | 3,053 | db01Job          | 2017-01-08 01:00:03 | B    | F     |   
43,942 |   150,874,925,633 | f

        Excerpt from the logs:
        07-Jan 05:00 phisbackupdns1-dir JobId 3052: shell command: run AfterJob 
"/usr/lib64/bacula/delete_catalog_backup"
        08-Jan 01:00 phisbackupdns1-dir JobId 3053: Start Backup JobId 3053, 
Job=phisdb01Job.2017-01-08_01.00.00_52
        08-Jan 01:00 phisbackupdns1-dir JobId 3053: Using Device "Drive0"
        08-Jan 01:00 phisbackupdns1-sd JobId 3053: Volume "A00053L5" previously 
written, moving to end of data.
        08-Jan 01:01 phisbackupdns1-sd JobId 3053: Warning: For Volume 
"A00053L5":
        The number of files mismatch! Volume=1955 Catalog=0
        Correcting Catalog
        08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of Volume "A00053L5" at 2106:1 on 
device "Drive0" (/dev/nst1). Write of 64512
        bytes got -1.
        08-Jan 01:31 phisbackupdns1-sd JobId 3053: Re-read of last block 
succeeded.
        08-Jan 01:31 phisbackupdns1-sd JobId 3053: End of medium on Volume 
"A00053L5" Bytes=150,990,400,512 Blocks=2,340,501 at 08-Ja
        n-2017 01:31.
        08-Jan 01:31 phisbackupdns1-sd JobId 3053: 3307 Issuing autochanger "unload 
slot 2, drive 0" command.
        08-Jan 01:33 phisbackupdns1-sd JobId 3053: No slot defined in catalog (slot=0) for Volume 
"A00032L5" on "Drive0" (/dev/nst1).
        08-Jan 01:33 phisbackupdns1-sd JobId 3053: Cartridge change or "update 
slots" may be required.
        08-Jan 01:33 phisbackupdns1-sd JobId 3053: Warning: mount.c:217 Open device 
"Drive0" (/dev/nst1) Volume "A00032L5" failed: ER
        R=dev.c:513 Unable to open device "Drive0" (/dev/nst1): ERR=No medium 
found


        Hardware compression is enabled:
        # tapeinfo -f /dev/nst1
        Product Type: Tape Drive
        Vendor ID: 'IBM     '
        Product ID: 'ULT3580-TD5     '
        Revision: 'G360'
        Attached Changer API: No
        SerialNumber: '10WT008032'
        MinBlock: 1
        MaxBlock: 8388608
        SCSI ID: 1
        SCSI LUN: 0
        Ready: yes
        BufferedMode: yes
        Medium Type: 0x58
        Density Code: 0x58
        BlockSize: 0
        DataCompEnabled: yes
        DataCompCapable: yes
        DataDeCompEnabled: yes
        CompType: 0x1
        DeCompType: 0x1
        BOP: yes
        Block Position: 0
        Partition 0 Remaining Kbytes: -1
        Partition 0 Size in Kbytes: -1
        ActivePartition: 0
        EarlyWarningSize: 0
        NumPartitions: 0
        MaxPartitions: 1

        Pool configuration for the volume:
        Pool {
           Name = ADHOC
           Label Format = "ADHOC_Vol"
           Pool Type = Backup
           Recycle = yes
           AutoPrune = yes
           Storage = ibmts3310
           Volume Retention = 12h
           Recycle Current Volume = Yes
        }

        Side note: I just realized that I missed the "Volume Use Duration = 
10h" directive in the pool. Reason being is the same tape would be in the drive for 
2 nights (Saturday and Sunday), since there is no operator around to change a tape. The 
tape supposed to be recycled on Sunday night.


        Appreciate if anyone can enlighten me as to why the tape is full way 
earlier compared to the size that it is able to contain.

        Thanks.

        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, SlashDot.org!http://sdm.link/slashdot

        _______________________________________________
        Bacula-users mailing list
        Bacula-users@lists.sourceforge.net
        <mailto:Bacula-users@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/bacula-users
        <https://lists.sourceforge.net/lists/listinfo/bacula-users>


        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, SlashDot.org!http://sdm.link/slashdot

        _______________________________________________
        Bacula-users mailing list
        Bacula-users@lists.sourceforge.net
        <mailto:Bacula-users@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/bacula-users
        <https://lists.sourceforge.net/lists/listinfo/bacula-users>


        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's
        most engaging tech sites, SlashDot.org!
        http://sdm.link/slashdot
        _______________________________________________ Bacula-users
        mailing list Bacula-users@lists.sourceforge.net
        <mailto:Bacula-users@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/bacula-users

<https://lists.sourceforge.net/lists/listinfo/bacula-users>

    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    _______________________________________________ Bacula-users
    mailing list Bacula-users@lists.sourceforge.net
    <mailto:Bacula-users@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/bacula-users

<https://lists.sourceforge.net/lists/listinfo/bacula-users>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Tape full before it should.

Reply via email to