Re: "vaulting run" amdump email report issues -- driver.c "nodump" option

2017-11-21 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 11:45:07 -0500, Jean-Louis Martineau wrote:
> Nathan,
> 
> I was too fast to write the first patch, and you are completely right.
> 
> I committed the attached patch,now amflush use the --no-dump argument 
> and driver merge both 'nodump' and '--no-dump'.

Yeah, looks good.

I applied this patch and ran the test again, and "amdump --no-dump
--no-flush" worked just as well as with the previous patch, but showed
no "dumper lines" in the log.*.0 file.

(Of course the email report did still have the unexpected RESULTS
MISSING line for each DLE.)

I also tried the amflush command with this patch applied, and it too
started up the driver and flushed files to the target storage without
any trouble.


Nathan


Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush and .tmp files in holding directory

2017-11-21 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> Each process should take a lock on the holding directory (a pid file), 

In my original amflush-during-amdump test the specific error I noticed 
was that amflush attempted to flush .tmp files out of the holding
disk (which in turn caused the taper process to crash).

The pid-lock fixes will reduce the chance of that problem, but if (for
example) an amdump process dies and leaves .tmp files in
the holding disk, I think amflush would still try to flush them.

Would it make sense for
Amholding.pm:get_files_for_flush()/$each_file_fn() to check the filename
and skip any that end in ".tmp" (similar to the check that exists in
holding.c:holding_get_walk_fn() )?

Nathan


Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 15:09:25 -0500, Jean-Louis Martineau wrote:
> perl automatically close it when it come out of scope.
> The close should before the 'if (kill' line.

Okay, I moved the close() line up to that spot as I applied the patch.

I can confirm that with your patch applied and my followup patch
applied, if I kick off amflush while amdump is running in the
background, it (amflush) prints out
  Could not find any Amanda directories to flush.
and exits without attempting to flush anything but if I try again
after amdump finished, I'm able to complete the flush successfully.


A few followup points that came up during my testing:

* when amflush completes now, the pid file is left out there in the
  holding directory, thus preventing the directory from getting deleted:

=
# ls -lR /amanda/TestBackup-holding/*
/amanda/TestBackup-holding/20171120172656:
total 4
-rw--- 1 backup backup 5 Nov 21 18:50 pid

/amanda/TestBackup-holding/20171121145009:
total 4
-rw--- 1 backup backup 5 Nov 21 18:50 pid
=

  Off hand I don't see code in amflush or Amflush.pm trying to clean up
  the holding disk directories so I don't know if these "zombie" pid
  files are related to the third issue (i.e. holding_cleanup_dir()),
  but...

* ...as a more general question I was wondering if it made sense
  for programs that take a lock on a holding directory to have a way to
  explicitly release that lock when they are done?  (Probably this
  would be by deleting the pid file, thought it could instead be
  truncating it to zero length, or something.)

  The main reason I was wondering about this is that with the current
  behavior of leaving a particular pid in the pid file forever, it
  seems like there's a good chance of false positives when the lock is
  checked by later jobs (i.e. some other, unrelated process using the
  same pid when the lock-check happens).  If there were some way for a
  process to release the lock, false positives could only happen after a
  process crashed.

  (I don't know how often these false positives would happen in
  practice, but I can imagine it would cause a headache if the pid
  locking a directory containing dumps needing to be flushed to vault
  storage ended up getting reused by some long-runing process...)

* I noticed that  holding.c:holding_cleanup_dir() doesn't include parent
  pid in the pid-file check.  I haven't tracked back to figure out which 
  applications call the holding_cleanup* family of functions so I don't
  know if that will cause actual problems, but I was wondering if that 
  function should instead just call call_take_holding() instead, so the
  logic is all found in one place?


Nathan





Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Tue, Nov 21, 2017 at 12:57:00 -0500, Nathan Stratton Treadway wrote:
> > I do not see how Amanda::Util::setup_application call code in holding.c
> 
> To be precise what I actually saw was activity involving the "pid" files
> in the holding directory showing up in an strace log.  I am not sure how
> it got called, but assume it was coming from holding.c because it had
> the arguments for the kill syscall in the correct order (and thus I
> believe the activity couldn't have been from the then-in-place
> Holding.pm code).  I will investigate further and let you know what I
> find.  
> 

Okay, sorry for the confusion -- now that I have paid careful attention
to the part of amflush prior to the Holding::get_all_datestamps() call,
I see that it's not setup_application() that updates the pid files, but
rather Disklist:add_holding_to_disklist().

>From there the program flow appears to be:
  Amanda::Logfile::search_holding_disk -> find.c:search_holding_disk()
  -> holding.c:holding_get_files() -> holding.c:holding_walk() ->
  holding_dir_stop_if_pid_fn() -> take_holding_pid()  

(Obviously the above flow only applies after the holding-pid-3.5.diff
patch has been applied.)

In any case, the pid file is updated by this call in the early part of
the amflush run (see below), which explains why the patch for the
Holding.pm logic needed to include support for the pid file pointing to
current/parent pid


(In order to confirm the origin of the pid updates, I added 
  system("find /amanda/TestBackup-holding -name pid -ls");
and a print-message line immediately before and after the call to
add_holding_to_disklist(), and here is the resulting output:

=
# su backup -lc "amflush TestBackup"
  5636101  4 -rw---   1 backup   backup  5 Nov 21 18:29 
/amanda/TestBackup-holding/20171121145009/pid
  5636099  4 -rw---   1 backup   backup  5 Nov 21 18:29 
/amanda/TestBackup-holding/20171120172656/pid
 calling Disklist:add_holding_to_disklist 
 returned from Disklist:add_holding_to_disklist 
  5636101  4 -rw---   1 backup   backup  5 Nov 21 18:32 
/amanda/TestBackup-holding/20171121145009/pid
  5636099  4 -rw---   1 backup   backup  5 Nov 21 18:32 
/amanda/TestBackup-holding/20171120172656/pid
Flushing dumps from 20171121145009 using storage "TestBackup", tapechanger "".
To volume TESTBACKUP-06 or a new volume,  (The last dumps were to volume 
TESTBACKUP-03
Flushing dumps from 20171121145009 using storage "TestOffsite", tape changer 
"doxpop_offsite_vtapes".
To volume TESTBACKUP-109 or a new volume,  (The last dumps were to volume 
TESTBACKUP-03

Are you sure you want to do this [yN]? 
=

[I know from my earlier debugging that the value put into  the updated
pid file is actually the parent pid.])
Nathan



Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Tue, Nov 21, 2017 at 11:45:52 -0500, Jean-Louis Martineau wrote:
> There is no patch attached on your email.

Ah, sorry, attached here.

> 
> I do not see how Amanda::Util::setup_application call code in holding.c

To be precise what I actually saw was activity involving the "pid" files
in the holding directory showing up in an strace log.  I am not sure how
it got called, but assume it was coming from holding.c because it had
the arguments for the kill syscall in the correct order (and thus I
believe the activity couldn't have been from the then-in-place
Holding.pm code).  I will investigate further and let you know what I
find.  

(In any case, I am sure that the pid read from the holding-directory pid
file during the get_all_datestamps() call was the parent pid of the
current process [and of course the file had not contained that value
before amflush was invoked].)


Nathan


Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239
--- Holding.pm.orig	2017-11-20 16:51:32.0 -0500
+++ Holding.pm	2017-11-21 02:10:47.992749021 -0500
@@ -25,6 +25,7 @@
 use File::stat;
 use IO::Dir;
 use POSIX qw( :fcntl_h );
+use Errno;
 use Math::BigInt;
 use strict;
 use warnings;
@@ -198,6 +199,36 @@
 return 1;
 }
 
+# check to see if the given pid identifies a process that could
+# validily be holding the Amanda directory lock (thus preventing 
+# us from taking it).  (See also holding.c:can_take_holding() for
+# related logic..)
+sub _valid_locking_pid {
+my $pid_to_check = shift;
+
+if ( $pid_to_check == $$ || $pid_to_check == getppid() ) {
+# this process or the parent already hold the lock, so we
+# can go ahead and use the directory.
+return 0;
+}
+
+if (kill(0, $pid_to_check) == 0 ) {
+if ($! == Errno::EPERM) {
+	# the process exists  (but we don't have permission to kill it) 
+	return 1;
+}
+else {
+	# (if we reach here, presumably the errno was ESRCH)
+	# the process does not exist; we can use the directory
+	return 0;
+} 
+}
+else {
+	# kill succeeded, so process does exist
+	return 1;
+}
+}
+
 sub _walk {
 my ($file_fn, $verbose, $take_pid) = @_;
 
@@ -233,8 +264,10 @@
 	if (open(my $pidh,  $pidfn)) {
 		my $pid = <$pidh>;
 		close($pidh);
-		if (kill($pid, 0) == 0) {
-		# pid is alive, skip this directory
+
+		if ( _valid_locking_pid($pid) ) {
+		# $pid seems valid, so we skip over this directory
+		print $verbose "skipping directory '$dirfn' due to lock by pid $pid. \n" if $verbose;
 		next;
 		}
 	}


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Jean-Louis Martineau
Nathan,

There is no patch attached on your email.

I do not see how Amanda::Util::setup_application call code in holding.c

Jean-Louis

On 21/11/17 02:52 AM, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 19:03:52 -0500, Nathan Stratton Treadway wrote:
> > On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> > > + if (open(my $pidh, $pidfn)) {
> > > + my $pid = <$pidh>;
> > > + if (kill($pid, 0) == 0) {
> > > + # pid is alive, skip this directory
> > > + next;
> > > + }
> >
> > looks like the order of arguments for the kill() function is reversed in
> > Perl, so the above code fails. I will send a patch with a fix shortly.
>
> When I tried using amflush after applying this patch, I found that
> it was never finding any directories it was willing to flush.
>
> When I debugged a little, I discovered that the Perl kill function
> actually has arguments in reverse order:
> kill SIGNAL, LIST
>
> However, even after fixing that function call amflush still didn't find
> any directories it was willing to flush. Debugging further revealed
> that the "Amanda::Util::setup_application()" call at the top of the
> amflush script ends up (via holding.c) taking over the pid file -- but
> using the parent pid rather than the current process pid. In the case
> of amflush, the parent pid is the shell that invoked amflush, so it's
> still running when the program flow reaches the call to
> Amanda::Holding::get_all_datestamps() , and thus the _walk function
> always thought the directory-lock was still validly held.
>
> The attached patch attempts to fix those problems... but the logic to
> deal with all the above conditions was complex enough that I figured it
> would work better to move it into a separate function.
>
> I have not fully tested this patch yet, but figured I'd send it back so
> you can take a look at it and see if it seems liek the right approach.
>
> (For example, let me know if you can think of a better name than
> "_valid_locking_pid".)
>
>
> Nathan
>
> p.s. I based the kill() ==0/$! == Errno::EPERM logic off of the
> discussion found here:
> https://unix.stackexchange.com/a/294990 
> 
> (i.e. the answer by Stephen Harris found in:
> https://unix.stackexchange.com/questions/294984/how-should-i-check-whether-a-given-pid-is-running/
>  
> 
> )
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail