Hey, all. Just to resurrect this old thread, we ran into all of these same issues, plus more.
############ POSSIBLY FIXED #1, My 8.1.13.100 servers do not seem to leave hung stgrule target processes around. Also, it looks like 8.1.13.012 has some additional fixes that help this. 012 patches may not are not 100% in 8.1.13.100, which looks to be based off of 8.1.13.010. These are the ones in 8.1.13.012 but not in 8.1.13.100: * IT40506 - REPLICATION JOB TRIGGERED FROM A REPLICATION STORAGE RULE LASTING BEYOND DEFINED DURATION OR APPEARING TO HANG * IT40121 - REPLICATION STGRULE MAY ENCOUNTER A HANG WHEN THE SDREPLTCRPHASE CHECKTHREAD THREAD EXITS EARLY * IT39715 - COPY TO TAPE USING A STORAGE RULE FAILS WITH ERROR ANR0102E * IT40973 - ANR1652E REPLICATION FAILED MESSAGE APPEARS AT THE END OF A SUCCESSFUL REPLICATION STORAGE RULE * IT40995 - A STGRULE OF AN ACTIONTYPE=COPY WON'T HONOR THE CANCEL PROCESS COMMAND ############ WORKAROUND #2, The issue where START STGRULE hangs until resourcetimeout and aborts is not resolved. Note that, related, is ongoing monitoring that does things like QUERY NODES and QUERY BACKUPSET will also hang if the STGRULE is hung. By hung, it never shows up in Q PROC, but if you ran it from DSMADMC, it just sits there never starting. This one is related to client sessions holding table locks on the nodes view. So far, support says to either ensure there is a time slot with zero client activity, or make an admin schedule/script that runs DISABLE SESS CLIENT, then does some sort of delay to ensure client sessions are gone, then runs START STGRULE, then ENABLE SESS. Once the START STGRULE returns, the hang risk is not really a problem anymore. Note that REPLICATE NODE does not have this issue. It's purely the stgrule based replication that does this. Still hoping for adjustments in lock handling during stgrule start. ############ FIXED #3, A key fix was IT40338, but really, most of 8.1.13.010 / 8.1.13.100 is based on stgrule fixes. We have a couple issues fixed in 8.1.13.012 for target-side termination hangs, and that's not in 8.1.12.100. All of the issues were mostly related to busy servers where Oracle logs are backed up throughout the day for shorter RPO. This also improved the issue where external monitoring sessions would hang and accumulate in Q SES. 5 weeks in, and we have not had client hangs / slow backups when replication has run or is running. ############ POSSIBLY FIXED #4, As to the tiering by filespace issue, once we put on 8.1.13.100, our draining pools have begun moving data again. I still have a lot of unmigrated data, but TIER STGPOOL counts are incrementing. ############ For context, the issues mostly seem related to our servers with DBs over 3TB. This environment is around 4PB after dedupe, 15 ingest servers, plus an old set of replicas, and a new set of replicas (in transition). Some of our servers are active all the time, and some are particularly large. We ran into expiration issues and chunk deletion issues that left our DBs pretty large, and fragmented. Offline reorg takes too long. 70kiops SSD available for the DB and we still struggle with admin jobs. With friendly Regards, Josh-Daniel S. Davis On 1/21/22 04:06:36 -0800 Michael Prix wrote: Hello Eric, customers of mine are seeing issues 1 and 2 also after appying 8.1.13 and have tickets open. Issue 3 with some earlier version, but not presently with 8.1.13. As for storage rule tiering, we have an interesting problem open, and IBM is, after weeks, not denying nor approving the fact that there might be a problem. We want to tier only specific filespaces of some nodes. This should be possible by applying a notier rule with some tier subrules, but there is no way of defining a subrule for a filespace, if it isn't a filespace containing a backup of a VM. In the description of the stgrule definition, there is only one sentence pointing to this possibility, but until now no confirmation from IBM, that this might be the source of our problem. -- Michael Prix On 1/17/22 10:16 AM, Loon, Eric van (ITOP NS) - KLM wrote: Hi everybody, I have recently upgraded my servers to 8.1.13.0 so that I could replace the (bad performing) protect stgpool and replicate nodes with the new storage rule replication. I found it to be very buggy. I ran into several very weird issues: 1) When a replication is canceled on the source server, the inbound replication process on the target server isn't ending, which doesn't allow one to start a new replication. Every new replication results in the error: "ANR3875E START STGRULE: A previous replication storage rule is processing on QVIP6, wait the process to complete". The only way out of this state is bouncing one of the servers. 2) Replication sometimes hangs without doing anything. A cancel replication results in the above. 3) I also have been called twice with complaints from customers that their backups were not running. The server showed a huge amount of sessions in the starting state and the admin console showed very little updates. As soon as I canceled the running replication, all sessions started to work again and all client backups continued! 4) Storage rule tiering hangs too after running for a while, canceling the running tiering process is not working either. Is anybody else experiencing these issues too? I have a case open for issue 1 and 4, but I can't believe I'm the only one with these issues... Kind regards, Eric van Loon Air France/KLM Storage & Backup