8.1.13.0 storage rules buggy

Josh-Daniel Davis Mon, 11 Jul 2022 08:41:34 -0700

Hey, all.  Just to resurrect this old thread, we ran into all of these same
issues, plus more.


############
POSSIBLY FIXED #1, My 8.1.13.100 servers do not seem to leave hung stgrule
target processes around.

Also, it looks like 8.1.13.012 has some additional fixes that help this.
012 patches may not are not 100% in 8.1.13.100, which looks to be based off
of 8.1.13.010.

These are the ones in 8.1.13.012 but not in 8.1.13.100:
* IT40506 - REPLICATION JOB TRIGGERED FROM A REPLICATION STORAGE RULE
LASTING BEYOND DEFINED DURATION OR APPEARING TO HANG
* IT40121 - REPLICATION STGRULE MAY ENCOUNTER A HANG WHEN THE
SDREPLTCRPHASE CHECKTHREAD THREAD EXITS EARLY
* IT39715 - COPY TO TAPE USING A STORAGE RULE FAILS WITH ERROR ANR0102E
* IT40973 - ANR1652E REPLICATION FAILED MESSAGE APPEARS AT THE END OF A
SUCCESSFUL REPLICATION STORAGE RULE
* IT40995 - A STGRULE OF AN ACTIONTYPE=COPY WON'T HONOR THE CANCEL PROCESS
COMMAND

############
WORKAROUND #2, The issue where START STGRULE hangs until resourcetimeout
and aborts is not resolved.  Note that, related, is ongoing monitoring that
does things like QUERY NODES and QUERY BACKUPSET will also hang if the
STGRULE is hung.  By hung, it never shows up in Q PROC, but if you ran it
from DSMADMC, it just sits there never starting.

This one is related to client sessions holding table locks on the nodes
view.  So far, support says to either ensure there is a time slot with zero
client activity, or make an admin schedule/script that runs DISABLE SESS
CLIENT, then does some sort of delay to ensure client sessions are gone,
then runs START STGRULE, then ENABLE SESS.  Once the START STGRULE returns,
the hang risk is not really a problem anymore.

Note that REPLICATE NODE does not have this issue.  It's purely the stgrule
based replication that does this.  Still hoping for adjustments in lock
handling during stgrule start.

############
FIXED #3, A key fix was IT40338, but really, most of 8.1.13.010 /
8.1.13.100 is based on stgrule fixes. We have a couple issues fixed in
8.1.13.012 for target-side termination hangs, and that's not in
8.1.12.100.  All of the issues were mostly related to busy servers where
Oracle logs are backed up throughout the day for shorter RPO.  This also
improved the issue where external monitoring sessions would hang and
accumulate in Q SES.  5 weeks in, and we have not had client hangs / slow
backups when replication has run or is running.

############
POSSIBLY FIXED #4, As to the tiering by filespace issue, once we put on
8.1.13.100, our draining pools have begun moving data again.  I still have
a lot of unmigrated data, but TIER STGPOOL counts are incrementing.

############
For context, the issues mostly seem related to our servers with DBs over
3TB.  This environment is around 4PB after dedupe, 15 ingest servers, plus
an old set of replicas, and a new set of replicas (in transition).

Some of our servers are active all the time, and some are particularly
large.  We ran into expiration issues and chunk deletion issues that left
our DBs pretty large, and fragmented.  Offline reorg takes too long.
70kiops SSD available for the DB and we still struggle with admin jobs.


With friendly Regards,
Josh-Daniel S. Davis

On 1/21/22 04:06:36 -0800 Michael Prix wrote:


Hello Eric,

customers of mine are seeing issues 1 and 2 also after appying 8.1.13 and have
tickets open. Issue 3 with some earlier version, but not presently with 8.1.13.

As for storage rule tiering, we have an interesting problem open, and IBM is,
after weeks, not denying nor approving the fact that there might be a problem.

We want to tier only specific filespaces of some nodes. This should be possible
by applying a notier rule with some tier subrules, but there is no way of
defining a subrule for a filespace, if it isn't a filespace containing a backup
of a VM. In the description of the stgrule definition, there is only one
sentence pointing to this possibility, but until now no confirmation from IBM,
that this might be the source of our problem.


--
Michael Prix

On 1/17/22 10:16 AM, Loon, Eric van (ITOP NS) - KLM wrote:

    Hi everybody,

    I have recently upgraded my servers to 8.1.13.0 so that I could replace the
(bad performing) protect stgpool and replicate nodes with the new storage rule
replication. I found it to be very buggy. I ran into several very weird issues:

    1)     When a replication is canceled on the source server, the inbound
replication process on the target server isn't ending, which doesn't allow one
to start a new replication. Every new replication results in the error:
"ANR3875E START STGRULE: A previous replication storage rule is processing on
QVIP6, wait the process to complete". The only way out of this state is
bouncing one of the servers.

    2)     Replication sometimes hangs without doing anything. A cancel
replication results in the above.

    3)     I also have been called twice with complaints from customers that
their backups were not running. The server showed a huge amount of sessions in
the starting state and the admin console showed very little updates. As soon as
I canceled the running replication, all sessions started to work again and all
client backups continued!

    4)     Storage rule tiering hangs too after running for a while, canceling
the running tiering process is not working either.
    Is anybody else experiencing these issues too? I have a case open for issue
1 and 4, but I can't believe I'm the only one with these issues...

    Kind regards,
    Eric van Loon
    Air France/KLM Storage & Backup

8.1.13.0 storage rules buggy

Reply via email to