Re: [OpenAFS] Re: 'vos dump' destroys volumes?
On Wed, Apr 4, 2012 at 4:28 AM, Matthias Gerstner wrote: >> what package management? 1.6.1 final is available at openafs.org; an >> announcement will be sent sometime today. > > I'm running Gentoo Linux and thus use Gentoo portage for package > management. They're usually rather quick with integrating new packages > but right now 1.6.1_pre1 is the most recent. > >> Sure, but it won't be in your package system either; If you're going >> to build something anyway, why not build something > > It's because it would be rather simple for me to inject the patch into > the build script for 1.6.1_pre1 (Gentoo portage builds from sources > anyway). But building from a completely different version would become > somewhat more complex as Portage downloads its own sources and also > applies its own patchset to the sources. > > The risk of messing something up would thus be greater for me. That's my > main concern here. try openafs-1.6.1.ebuild from /afs/your-file-system.com/user/shadow no patches should be needed. -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
> what package management? 1.6.1 final is available at openafs.org; an > announcement will be sent sometime today. I'm running Gentoo Linux and thus use Gentoo portage for package management. They're usually rather quick with integrating new packages but right now 1.6.1_pre1 is the most recent. > Sure, but it won't be in your package system either; If you're going > to build something anyway, why not build something It's because it would be rather simple for me to inject the patch into the build script for 1.6.1_pre1 (Gentoo portage builds from sources anyway). But building from a completely different version would become somewhat more complex as Portage downloads its own sources and also applies its own patchset to the sources. The risk of messing something up would thus be greater for me. That's my main concern here. Best regards, Matthias -- Matthias Gerstner, Dipl.-Wirtsch.-Inf. (FH), Senior Software Engineer e.solutions GmbH Am Wolfsmantel 46, 91058 Erlangen, Germany Registered Office: Pascalstr. 5, 85057 Ingolstadt, Germany Phone +49-8458-3332-672, mailto:matthias.gerst...@esolutions.de Fax +49-8458-3332-20672 e.solutions GmbH Managing Directors Uwe Reder, Dr. Riclef Schmidt-Clausen Register Court Ingolstadt HRB 5221 ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
On Tue, Apr 3, 2012 at 4:47 AM, Matthias Gerstner wrote: > Greetings, > >> 1.6.0 has the same bug, so, not really. > > okay, thanks for the hint! > >> 1.6.1pre4 would be a much better choice. > > Sadly that version isn't available yet in my package management. what package management? 1.6.1 final is available at openafs.org; an announcement will be sent sometime today. > Is there a patch available that fixes the data corruption bug in pre1? Sure, but it won't be in your package system either; If you're going to build something anyway, why not build something current? -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: 'vos dump' destroys volumes?
On Wed, 28 Mar 2012 19:01:26 +0200 Matthias Gerstner wrote: > > That version is known to have issues with data corruption/loss, which > > are fixed in pre4. I don't know if that's what you're hitting, though. > > (You can also run a newer client with older servers just fine.) > > So it seems I'm better off falling back to 1.6.0 on the server side > then. As Derrick said, "no". > The tokens expired due to an error I introduced in the backup script. > My approach is to renew authentication after each dump. Token lifetime > is eight hours. And I haven't got any volumes that should take longer > than that for dumping. > > What would a tool look like that refreshes tokens during a long dump? > Something like a background process in the same authentication group? Don't make your own; run whatever process you're running under k5start (or k5renew or similar), which handles a lot of the details for you. Or use -localauth. > > Hmm, did you forget to attach this? > > Sorry. I'm adding it now. It's the log for the "sealed data > inconsistent" error. As you can see in the log the backup continued > another ten minutes before it finally failed due to "token expired" > failure. But it surely was related to the expired token, too. No, the volserver errored out the operation at 03:32:01, and the 'vos' process should have gotten an error soon after that and died. The remaining ten minutes of 'trans foo has been idle' just says that the transaction for the volume still exists, but nothing is using it. I'm not sure why you're getting that specific error rather than it telling you the token has expired, but yeah, if it's plausible that the token did expire, that seems likely to be the reason why it occurred. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
On Wed, Mar 28, 2012 at 1:01 PM, Matthias Gerstner wrote: > Hello, > >> Please save the log if it happens again. Just a directory object being >> corrupt will not delete its children unless you pass '-orphans remove' >> to the salvager. However, the default, '-orphans ignore' will keep >> orphaned data around but it will be effectively invisible until you >> salvage with '-orphans attach'. > > yes I understand this mechanism regarding orphaned files. I was using > the default. However, attaching the orphaned files wouldn't have helped > me much in this particular case. As I mentioned the volume contained > about 3.5 millions files and the directory structure was a crucial part > of the data. I would never have been able to reconstruct the original > data. > >> That version is known to have issues with data corruption/loss, which >> are fixed in pre4. I don't know if that's what you're hitting, though. >> (You can also run a newer client with older servers just fine.) > > So it seems I'm better off falling back to 1.6.0 on the server side > then. 1.6.0 has the same bug, so, not really. 1.6.1pre4 would be a much better choice. -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: 'vos dump' destroys volumes?
Hello, > Please save the log if it happens again. Just a directory object being > corrupt will not delete its children unless you pass '-orphans remove' > to the salvager. However, the default, '-orphans ignore' will keep > orphaned data around but it will be effectively invisible until you > salvage with '-orphans attach'. yes I understand this mechanism regarding orphaned files. I was using the default. However, attaching the orphaned files wouldn't have helped me much in this particular case. As I mentioned the volume contained about 3.5 millions files and the directory structure was a crucial part of the data. I would never have been able to reconstruct the original data. > That version is known to have issues with data corruption/loss, which > are fixed in pre4. I don't know if that's what you're hitting, though. > (You can also run a newer client with older servers just fine.) So it seems I'm better off falling back to 1.6.0 on the server side then. > I assume the volserver is running the same version? As Kim said, > 'rxdebug 7005 -version' Yes it's the same version. I'm running 1.6.1_pre1 for about two months now. Systems have also been rebooted back then. So all versions are consistent. Problems only arised recently, though. > If you turn on the volser audit log with > '-auditlog /usr/afs/logs/VolserLog.audit' or something I'll consider increasing the log level. For now I try to calm down the systems a bit. Today I was able to perform a complete backup without errors. I'm will continue to monitor it closely. > So, you just have two completely separate servers, and each one is > running a fileserver/volserver? Yeah, that shouldn't matter. Exactly. > That's "sealed data inconsistent". You can get this if your tokens > expired sometime during the process (I don't remember / just don't know > what causes that vs an 'expired' message). Do you have the output of > 'vos' running with '-verbose' by any chance? How long are the backups > taking, and are you running this under a tool to refresh tokens? Yes it probably had to do with expiring authentication. Only it's confusing that after that specific error I was able to dump another volume. Only then I received a token expiration error. The tokens expired due to an error I introduced in the backup script. My approach is to renew authentication after each dump. Token lifetime is eight hours. And I haven't got any volumes that should take longer than that for dumping. What would a tool look like that refreshes tokens during a long dump? Something like a background process in the same authentication group? > Hmm, did you forget to attach this? Sorry. I'm adding it now. It's the log for the "sealed data inconsistent" error. As you can see in the log the backup continued another ten minutes before it finally failed due to "token expired" failure. But it surely was related to the expired token, too. Best regards, Matthias -- Matthias Gerstner, Dipl.-Wirtsch.-Inf. (FH), Senior Software Engineer e.solutions GmbH Am Wolfsmantel 46, 91058 Erlangen, Germany Registered Office: Pascalstr. 5, 85057 Ingolstadt, Germany Phone +49-8458-3332-672, mailto:matthias.gerst...@esolutions.de Fax +49-8458-3332-20672 e.solutions GmbH Managing Directors Uwe Reder, Dr. Riclef Schmidt-Clausen Register Court Ingolstadt HRB 5221 Tue Mar 27 03:07:49 2012 1 Volser: Clone: Cloning volume 536884635 to new volume 536889553 Tue Mar 27 03:13:00 2012 trans 112 on volume 536889553 is older than 300 seconds Tue Mar 27 03:13:30 2012 trans 112 on volume 536889553 is older than 330 seconds Tue Mar 27 03:14:00 2012 trans 112 on volume 536889553 is older than 360 seconds Tue Mar 27 03:14:30 2012 trans 112 on volume 536889553 is older than 390 seconds Tue Mar 27 03:15:00 2012 trans 112 on volume 536889553 is older than 420 seconds Tue Mar 27 03:15:30 2012 trans 112 on volume 536889553 is older than 450 seconds Tue Mar 27 03:16:01 2012 trans 112 on volume 536889553 is older than 480 seconds Tue Mar 27 03:16:31 2012 trans 112 on volume 536889553 is older than 510 seconds Tue Mar 27 03:17:01 2012 trans 112 on volume 536889553 is older than 540 seconds Tue Mar 27 03:17:31 2012 trans 112 on volume 536889553 is older than 570 seconds Tue Mar 27 03:18:01 2012 trans 112 on volume 536889553 is older than 600 seconds Tue Mar 27 03:18:31 2012 trans 112 on volume 536889553 is older than 630 seconds Tue Mar 27 03:19:01 2012 trans 112 on volume 536889553 is older than 660 seconds Tue Mar 27 03:19:31 2012 trans 112 on volume 536889553 is older than 690 seconds Tue Mar 27 03:20:01 2012 trans 112 on volume 536889553 is older than 720 seconds Tue Mar 27 03:20:31 2012 trans 112 on volume 536889553 is older than 750 seconds Tue Mar 27 03:21:01 2012 trans 112 on volume 536889553 is older than 780 seconds Tue Mar 27 03:21:31 2012 trans 112 on volume 536889553 is older than 810 seconds Tue Mar 27 03:22:01 2012 trans 112 on volume 536889553 is older than 840 seconds Tue Mar 27 03:22:31 2012 trans 112
[OpenAFS] Re: 'vos dump' destroys volumes?
On Tue, 27 Mar 2012 14:01:04 +0200 Matthias Gerstner wrote: > The situation with the salvage was as follows: The affected volume > was a pretty large volume containing about 160 gigabytes of data spread > across 3.5 million files. During the salvage I saw a *lot* of log lines > similar to this flying by: > > '??/??/SomeFile' deleted. > > After half an hour of seeing this the volume was back online with less > than 10 gigabytes of data remaining. So I figured the top-level > directory structure got somehow lost. Sorry that I can't provide the > actual log any more. Please save the log if it happens again. Just a directory object being corrupt will not delete its children unless you pass '-orphans remove' to the salvager. However, the default, '-orphans ignore' will keep orphaned data around but it will be effectively invisible until you salvage with '-orphans attach'. > Seems I forgot to mention 'pre1': > > # strings /usr/sbin/vos | grep built > @(#) OpenAFS 1.6.1pre1 built 2012-01-24 > > Is it too risky to use the pre-release? I got used to running the > unstable openafs packages for being able to keep up with recent Linux > kernel versions. That version is known to have issues with data corruption/loss, which are fixed in pre4. I don't know if that's what you're hitting, though. (You can also run a newer client with older servers just fine.) I assume the volserver is running the same version? As Kim said, 'rxdebug 7005 -version' > Now that you say it, it really does look like two things are running > in parallel. But I can't think of how that could be happening. The > backup script is supposed to dump one volume after another in a serial > manner. And on this specific server the backup script is the only > administrative AFS operation that is scheduled at all. Also when I > disable the backup job for a night then nothing shows up in the log at > all. If you turn on the volser audit log with '-auditlog /usr/afs/logs/VolserLog.audit' or something, you can see specifically what operations were run when and by whom. Or turn up the debug level with '-d 125 -log', and you'll see a bunch more information in VolserLog interspersed with everything else. > However, I'm running two pairs of file and volume server. Each machine > performs a backup of its volumes and this happens in parallel. But > this shouldn't affect a single machines log. So, you just have two completely separate servers, and each one is running a fileserver/volserver? Yeah, that shouldn't matter. > I'm getting continued weird behaviour during my backups. Last night > for example a dump was aborted with the following error message: > > 'consealed data inconsistent' That's "sealed data inconsistent". You can get this if your tokens expired sometime during the process (I don't remember / just don't know what causes that vs an 'expired' message). Do you have the output of 'vos' running with '-verbose' by any chance? How long are the backups taking, and are you running this under a tool to refresh tokens? > However the original volume in question remained intact this time. I'm > attaching the VolserLog of this incident. Hmm, did you forget to attach this? -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
You're right, but the cloning is fast ... so avoids most of the writability issue, esp for large volumes ... As for 'vos' ... lol -- maybe I should try _reading_ what's been written. Sheesh. Thx. Kim On 3/26/2012 2:02 PM, Andrew Deason wrote: > On Mon, 26 Mar 2012 13:17:14 -0600 > Kim Kimball wrote: > >> Dumping the RW volume makes it "busy" during the dump, which makes the >> volume unwritable -- and generates "afs: Waiting for busy volume" errors >> when a write occurs. > Matthias used the -clone option, which should avoid that. The volume > will not be writeable during the clone operation, though. > >> Identifying the software version that is running is better done with >> "rxdebug" -- it's a nit, but the binaries are not guaranteed to be the >> same as what's running -- and the "strings | grep" approach only tells >> you what version the binary is, and not what the running version is ... > Yeah yeah; that can be a tad inconvenient for 'vos', though :) > ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
Of course. Didn't look at the numeric IDs. I assume that's what happened here? On 3/26/2012 2:19 PM, Derrick Brashear wrote: > On Mon, Mar 26, 2012 at 3:17 PM, Kim Kimball wrote: >> Dumping the RW volume makes it "busy" during the dump, which makes the >> volume unwritable -- and generates "afs: Waiting for busy volume" errors >> when a write occurs. >> >> Dumping the .backup is not just a good practice, in my opinion, it is >> the only sensible practice if keeping writability is important. Large >> volumes can take a while to dump -- >> >> Identifying the software version that is running is better done with >> "rxdebug" -- it's a nit, but the binaries are not guaranteed to be the >> same as what's running -- and the "strings | grep" approach only tells >> you what version the binary is, and not what the running version is ... >> >> It does look like more than one operation was in progress -- a volume >> delete isn't part of a volume dump > the temporary clone gets cleaned up at the end. > ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
On Mon, Mar 26, 2012 at 3:17 PM, Kim Kimball wrote: > Dumping the RW volume makes it "busy" during the dump, which makes the > volume unwritable -- and generates "afs: Waiting for busy volume" errors > when a write occurs. > > Dumping the .backup is not just a good practice, in my opinion, it is > the only sensible practice if keeping writability is important. Large > volumes can take a while to dump -- > > Identifying the software version that is running is better done with > "rxdebug" -- it's a nit, but the binaries are not guaranteed to be the > same as what's running -- and the "strings | grep" approach only tells > you what version the binary is, and not what the running version is ... > > It does look like more than one operation was in progress -- a volume > delete isn't part of a volume dump the temporary clone gets cleaned up at the end. -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: 'vos dump' destroys volumes?
On Mon, 26 Mar 2012 13:17:14 -0600 Kim Kimball wrote: > Dumping the RW volume makes it "busy" during the dump, which makes the > volume unwritable -- and generates "afs: Waiting for busy volume" errors > when a write occurs. Matthias used the -clone option, which should avoid that. The volume will not be writeable during the clone operation, though. > Identifying the software version that is running is better done with > "rxdebug" -- it's a nit, but the binaries are not guaranteed to be the > same as what's running -- and the "strings | grep" approach only tells > you what version the binary is, and not what the running version is ... Yeah yeah; that can be a tad inconvenient for 'vos', though :) -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 'vos dump' destroys volumes?
Dumping the RW volume makes it "busy" during the dump, which makes the volume unwritable -- and generates "afs: Waiting for busy volume" errors when a write occurs. Dumping the .backup is not just a good practice, in my opinion, it is the only sensible practice if keeping writability is important. Large volumes can take a while to dump -- Identifying the software version that is running is better done with "rxdebug" -- it's a nit, but the binaries are not guaranteed to be the same as what's running -- and the "strings | grep" approach only tells you what version the binary is, and not what the running version is ... It does look like more than one operation was in progress -- a volume delete isn't part of a volume dump Kim On 3/26/2012 11:38 AM, Andrew Deason wrote: > On Mon, 26 Mar 2012 17:25:04 +0200 > Matthias Gerstner wrote: > >> I'm recently experiencing trouble during my backup of OpenAFS volumes. >> I perform backups using the >> >> 'vos dump -server -partition -clone -id ' > I presume is an rw volume? > > Just so you know, a more common way of doing this is to use 'vos > backupsys' and then backup the .backup volumes. Nothing 'wrong' with > what you're doing, but it's a less common way. > >> However some days ago the backup of a specific volume failed with >> a bad exit code (255). My backup script thus stopped further processing. >> The concerned volume went offline as a result and did only show up in >> 'vos listvol' as "couldn't attach volume ...". > What did volserver say in VolserLog when that happened? It should give a > reason as to why it could not attach. > >> After running a salvage on the affected volume it was brought back >> online but most of the contained data was deleted due to a supposed >> corruption of the directory strucuture detected during salvage. > SalvageLog will say specifically why. Or SalsrvLog if you are running > DAFS; are you running DAFS? > >> Attached is the VolserLog from the time when the last of the incidents >> occured. > What was the volume id for the volume in question? Possibly 536879790 or > 536879793? > >> I'm currently running openafs 1.6.1 on Gentoo Linux with kernel >> version 3.2.1. > 1.6.1 is not a version that exists yet (or at least, certainly did not > exist on Friday). What version is the volserver, and what version is > 'vos'? (Running `strings | grep built` is a sure way to > tell.) > >> Fri Mar 23 00:10:57 2012 1 Volser: Clone: Cloning volume 536879790 to new >> volume 536889517 >> Fri Mar 23 00:16:04 2012 1 Volser: Delete: volume 536889517 deleted >> Fri Mar 23 00:16:04 2012 1 Volser: Clone: Cloning volume 536879793 to new >> volume 536889518 >> Fri Mar 23 00:16:06 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk >> header, error = 2 >> Fri Mar 23 00:16:06 2012 VPurgeVolume: Error -1 when destroying volume >> 536889517 header >> Fri Mar 23 00:16:06 2012 1 Volser: Delete: volume 536889517 deleted >> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted >> Fri Mar 23 00:16:09 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk >> header, error = 2 >> Fri Mar 23 00:16:09 2012 VPurgeVolume: Error -1 when destroying volume >> 536889518 header >> Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted >> Fri Mar 23 00:21:20 2012 trans 69 on volume 536889518 is older than 300 >> seconds >> Fri Mar 23 00:21:20 2012 trans 66 on volume 536889517 is older than 300 >> seconds > Hmm, are you sure 'vos dump' is the only thing you are running at the > time? (You're running more than one in parallel... how many do you run > at once?) This sequence of operations does not seem normal for just a > 'vos dump'. > ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: 'vos dump' destroys volumes?
On Mon, 26 Mar 2012 17:25:04 +0200 Matthias Gerstner wrote: > I'm recently experiencing trouble during my backup of OpenAFS volumes. > I perform backups using the > > 'vos dump -server -partition -clone -id ' I presume is an rw volume? Just so you know, a more common way of doing this is to use 'vos backupsys' and then backup the .backup volumes. Nothing 'wrong' with what you're doing, but it's a less common way. > However some days ago the backup of a specific volume failed with > a bad exit code (255). My backup script thus stopped further processing. > The concerned volume went offline as a result and did only show up in > 'vos listvol' as "couldn't attach volume ...". What did volserver say in VolserLog when that happened? It should give a reason as to why it could not attach. > After running a salvage on the affected volume it was brought back > online but most of the contained data was deleted due to a supposed > corruption of the directory strucuture detected during salvage. SalvageLog will say specifically why. Or SalsrvLog if you are running DAFS; are you running DAFS? > Attached is the VolserLog from the time when the last of the incidents > occured. What was the volume id for the volume in question? Possibly 536879790 or 536879793? > I'm currently running openafs 1.6.1 on Gentoo Linux with kernel > version 3.2.1. 1.6.1 is not a version that exists yet (or at least, certainly did not exist on Friday). What version is the volserver, and what version is 'vos'? (Running `strings | grep built` is a sure way to tell.) > Fri Mar 23 00:10:57 2012 1 Volser: Clone: Cloning volume 536879790 to new > volume 536889517 > Fri Mar 23 00:16:04 2012 1 Volser: Delete: volume 536889517 deleted > Fri Mar 23 00:16:04 2012 1 Volser: Clone: Cloning volume 536879793 to new > volume 536889518 > Fri Mar 23 00:16:06 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk > header, error = 2 > Fri Mar 23 00:16:06 2012 VPurgeVolume: Error -1 when destroying volume > 536889517 header > Fri Mar 23 00:16:06 2012 1 Volser: Delete: volume 536889517 deleted > Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted > Fri Mar 23 00:16:09 2012 VDestroyVolumeDiskHeader: Couldn't unlink disk > header, error = 2 > Fri Mar 23 00:16:09 2012 VPurgeVolume: Error -1 when destroying volume > 536889518 header > Fri Mar 23 00:16:09 2012 1 Volser: Delete: volume 536889518 deleted > Fri Mar 23 00:21:20 2012 trans 69 on volume 536889518 is older than 300 > seconds > Fri Mar 23 00:21:20 2012 trans 66 on volume 536889517 is older than 300 > seconds Hmm, are you sure 'vos dump' is the only thing you are running at the time? (You're running more than one in parallel... how many do you run at once?) This sequence of operations does not seem normal for just a 'vos dump'. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info