Re: Newbie question - rsync pauses after 500 - 800 files

2000-10-16 Thread Neil Schellenberger
id "classic" signal handlers if at all
   possible (!).  If non-trivial behaviour is required in response to
   a signal, try to arrange the program architecture around a work
   loop which (among doing its usual stuff) tests a flag (set by the
   signal handler as its only function).

   With the advent of "official" Posix semantics for signal service in
   threads, multi-threaded programs can get away with masking all
   signals in all threads, dedicating the "main()" thread to serially
   handle signals using sigpause() or somesuch, and doing the real
   work in other threads (leaving the inter-thread communication as an
   exercise for the student).

   

   But I doubt that this is the actual timeout problem itself, and I
   haven't time to figure out how to re-architect around it so I'll
   just plunk it on the table here and leave it for the experts to
   ruminate upon.

2) Tends to blow chunks under load.

   Running all of the modules at once seems to be a big problem.  I
   can't make any reasonable statements about memory/swap use until I
   can get physical access to the server machine, but I have no
   particular reason from the syslog on the server to suspect
   out-of-memory.

   Nonetheless, the one module that still fails consistently is
   probably the largest in total number of files (around 220k files).
   It may also be the largest in terms of file space (don't have a
   good estimate on hand).

3) Server actually appears to be exiting relatively "cleanly".

   I always seem to see the following kind of neighbouring syslogs:

  rsyncd[750]: wrote 432468 bytes  read 1325 bytes  total size 2021960482
  rsyncd[750]: transfer interrupted (code 11) at main.c(272)

   The fact that we see the "total size" message suggests to me that
   we got at least as far as the log_exit() call at the start of
   report().  The fact that I don't ever see the requested statistics
   at the client end suggests that the second message is generated by
   an io_error sometime before/during the stats transmition.  The fact
   that we made it as far as main:272 suggests a "clean" exit with an
   io_error=1 (remapped to code=RERR_FILEIO by _exit_cleanup) during
   stats transmission.  Unfortunately, I don't see any evidence of the
   rprintf() messages that seem to always be linked with setting
   io_error=1.  Mystifying.

   Note that I do not yet have hard proof that this is what's
   happening.  Until I can get a truss, snoop, and pstack output from
   the server running with -, I'm basically working without a
   net.  In fact, I can't even easily correlate the server and client
   logs because the time bases are not necessarily in sync.  Sigh.

4) Compiler optimiser bug, sparc architecture?

   Some autoconf-based packages are currently being released with
   auto-detection of gcc 2.95 and are forcing off optimisation.  I
   have been unable from a quick investigation to determine exactly
   what bug it is they are trying to work around (or indeed if this is
   simply hysteria caused by the aliasing-related optimisations being
   applied to code that was incorrect to start with.)

   That having been said, there were definitely serious optimiser bugs
   in egcs (the precursor to 2.95), and according to the change
   history these were still being fixed post 2.95.1.

Gotta run.  My rides wating (@%$@! broken ankle).  Welcome any and all
criticism.


Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




unexpected EOF hypothesis

2000-10-17 Thread Neil Schellenberger
ed error message
go?  Is io_multiplex_write() somehow eating it?  Or is it a red
herring of some kind?

Anyway, I'll be testing this out with a test client/server pair using
a small fileset and a very short timeout as soon as I can get it set
up.  (I don't have the disk or network resources to duplicate the 100G
problem locally and, regretably, --timeout is the only option
unconditionally overriden at the server end, meaning that I can't just
force it from my end.)

I just wanted to get this posted so that I can start the day tomorrow
with the "overnight" reaction from those in other time zones.  In
particular, is anyone who is using --timeout=0 (the default) on the
server nonetheless seeing unexpected EOF?


Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




unexpected EOF and the preparation and serving of crow

2000-10-18 Thread Neil Schellenberger


Folks,

Just in case there is any doubt remaining: yes, I am an idiot.

As was probably obvious to everyone but me, in the specific case that
I have bored everyone to tears with, I am running afoul of a feature
that I myself enabled (i.e. timeouts).  That is to say, the analysis
was correct, but completely pointless.  My only real problem is that
my timeout is too short.

In my defense, I can only say that it _used_ to be the case that the
client got an error message from the server upon timeout.  This
message seems to have been disabled in the 2.4.3 timeframe because it
caused other problems.  The absence of this message was what I think
mislead me.  (It also used to be the case that my timeout was plenty
long enough.  Then a lot more files were added)

I am sincerely sorry to have wasted everyone's time (including my
own!) without materially moving the investigation ahead.

So, for those interested in the morals I've learned:

 1) if I set a timeout, make darn sure it's long enough
because the client side error messages will be less than
informative in the event of a timeout;
 2) make sure that I'm looking at the right part of the server logs
(time skew notwithstanding) before I send kilobytes of drivel to
the mailing list; and
 3) do not repeatedly cry "The sky is falling! The sky is falling!".

Damn.  I really thought I had it this time.  Back to the drawing
board.  (The signal handling reentrancy issue is still a genuine
problem, though.)

In contrition, I submit the following patch in the hope that it will
help others interpret the log a bit more quickly.

--- log.c.orig  Sat Jan 29 06:35:03 2000
+++ log.c   Wed Oct 18 14:01:26 2000
@@ -25,6 +25,29 @@
 
 static FILE *logfile;
 static int log_error_fd = -1;
+static const struct errdesc { int c; const char *s; const char *d; } errcodes[] = {
+  { RERR_SYNTAX,  "RERR_SYNTAX",  "syntax or usage error" },
+  { RERR_PROTOCOL,"RERR_PROTOCOL","protocol incompatibility" },
+  { RERR_FILESELECT,  "RERR_FILESELECT",  "errors selecting input/output files, dirs" 
+},
+  { RERR_UNSUPPORTED, "RERR_UNSUPPORTED", "requested action not supported" },
+  { RERR_SOCKETIO,"RERR_SOCKETIO","error in socket IO" },
+  { RERR_FILEIO,  "RERR_FILEIO",  "error in file IO" },
+  { RERR_STREAMIO,"RERR_STREAMIO","error in rsync protocol data stream" },
+  { RERR_MESSAGEIO,   "RERR_MESSAGEIO",   "errors with program diagnostics" },
+  { RERR_IPC, "RERR_IPC", "error in IPC code" },
+  { RERR_SIGNAL,  "RERR_SIGNAL",  "status returned when sent SIGUSR1, SIGINT" 
+},
+  { RERR_WAITCHILD,   "RERR_WAITCHILD",   "some error returned by waitpid()" },
+  { RERR_MALLOC,  "RERR_MALLOC",  "error allocating core memory buffers" },
+  { RERR_TIMEOUT, "RERR_TIMEOUT", "timeout in data send/receive" },
+};
+
+static const struct errdesc *geterrdesc(int code)
+{
+int i;
+for ( i = 0; i < sizeof(errcodes)/sizeof(*errcodes); ++i )
+if ( code == errcodes[i].c ) break;
+return ( i < sizeof(errcodes)/sizeof(*errcodes) ? &errcodes[i] : NULL );
+}
 
 static void logit(int priority, char *buf)
 {
@@ -324,12 +347,16 @@
 /* called when the transfer is interrupted for some reason */
 void log_exit(int code, const char *file, int line)
 {
+const struct errdesc *pd = NULL;
if (code == 0) {
extern struct stats stats;  
rprintf(FLOG,"wrote %.0f bytes  read %.0f bytes  total size %.0f\n",
(double)stats.total_written,
(double)stats.total_read,
(double)stats.total_size);
+} else if ( (pd = geterrdesc(code)) != NULL ) {
+    rprintf(FLOG,"transfer interrupted - %s (code %d, %s) at %s(%d)\n",
+pd->d, code, pd->s, file, line);
} else {
rprintf(FLOG,"transfer interrupted (code %d) at %s(%d)\n", 
code, file, line);

Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




Re: I also am getting hang/timeout using rsync 2.4.6 -e ssh

2000-10-19 Thread Neil Schellenberger

>>>>> "ian" == ian stanley <[EMAIL PROTECTED]> writes:

ian> Anybody got any ideas why i could rsync without problems
ian> using 2.4.6 a few weeks ago and not now?

Ian,

Has the total "expected" transfer time increased over the past few
months?  My impression is that most of the EOF/hang problems people
are seeing seem related to "large" transfers.

Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




Re: I also am getting hang/timeout using rsync 2.4.6 --daemon

2000-10-19 Thread Neil Schellenberger

>>>>> "Eric" == Eric Whiting <[EMAIL PROTECTED]> writes:

Eric> Forgot to say that I love rsync. :) I don't mean to sound
Eric> like a complainer or rsync-grump. I'm just trying to help
Eric> out a little bit.  All these reports of problems are not
Eric> show-stoppers and only seem to affect very large transfers.

Even though for me this has been pretty nearly a show-stopper, add me
to the list of rsync lovers.  I'm sticking with it for several reasons
but in particular, the algorithm is superbly suited to the kind of
data we're transferring (huge data set, small deltas) and the
bandwidth/hardware that we have available for this task.

With open-source-ish software, we're only going to get out what we
(well Andrew and Dave primarily ;-) put in If it's broke, it's up
to the users to fix it.

Now if we could just figure out why it keeps stalling randomly

Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




OpenSSH hang (was Re: I also am getting hang/timeout using rsync 2.4.6 --daemon)

2000-10-19 Thread Neil Schellenberger


Eric,

Since the poll is nfds=0 and timeo=20 (i.e. almost certainly
msleep(20)) and since waitpid is looking for 17408, this actually
really has to be the call of wait_process() at main.c:532 where rsync
is (apparently) waiting for ssh to die.

The reason that timeout has no effect is that only io_flush() is being
called in this loop and since there is (presumably) nothing more to be
writen so the usual I/O loop stuff (including check_timeout()) is not
being called.

If you can manage it, both pstack and pfiles output would be useful to
check if my guess of main.c:532 is right and to see if the pipe to ssh
is still open.  If the pipe is still open, we may have our culprit (it
doesn't realise it should be exiting?).  If it's closed, we'll need to
know what ssh is up to

Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
+ Greg Moore (1975-1999), Gentleman racer and great Canadian +




RE: rsync exit codes

2001-05-30 Thread Neil Schellenberger
( am_server ? 1 : 0 );
+need_local = 1;
+break;
+
+case FLOG:
+f = NULL;
+need_remote = 0;
+need_local = ( am_daemon ? 1 : 0 );
+break;
+
+case FERROR:
+f = stderr;
+need_remote = ( am_server ? 1 : 0 );
+need_local = 1;
+break;
+
+default:
+f = stderr;
+need_remote = ( am_server ? 1 : 0 );
+need_local = 1;
+break;
 
-/* then try to pass it to the other end */
-if (am_server && io_multiplex_write(code, buf, len)) {
-return;
 }
 
-if (am_daemon) {
-static int depth;
-int priority = LOG_INFO;
-if (code == FERROR) priority = LOG_WARNING;
+if ( need_local ) {
 
-if (depth) return;
+if ( am_daemon ) {
 
-depth++;
+int priority = ( code == FERROR ? LOG_WARNING : LOG_INFO );
 
-log_open();
 logit(priority, buf);
 
-depth--;
-return;
+} else {
+
+if (!f) exit_cleanup(RERR_MESSAGEIO);
+
+if (fwrite(buf, len, 1, f) != 1) exit_cleanup(RERR_MESSAGEIO);
+
+if (buf[len-1] == '\r' || buf[len-1] == '\n') fflush(f);
+
 }
 
-if (code == FERROR) {
-f = stderr;
 }
 
-if (code == FINFO) {
-if (am_server)
-f = stderr;
-else
-f = stdout;
+if ( need_remote ) {
+
+if ( depth <= 1 ) {
+
+int ok = 0;
+
+/* first try to pass it off to our sibling */
+if ( ! ok )
+ok = io_error_write(log_error_fd, code, buf, len);
+
+/* then try to pass it to the other end */
+    if ( ! ok )
+ok = io_multiplex_write(code, buf, len);
+
 }
 
-if (!f) exit_cleanup(RERR_MESSAGEIO);
+}
 
-if (fwrite(buf, len, 1, f) != 1) exit_cleanup(RERR_MESSAGEIO);
+--depth;
+
+return;
 
-if (buf[len-1] == '\r' || buf[len-1] == '\n') fflush(f);
 }
 
 



Regards,
Neil

-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
Orchestream | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata ON, Canada, K2K 2W5  | URL   : http://www.orchestream.com/



Re: rsync exit codes

2001-05-30 Thread Neil Schellenberger

>>>>> "Dave" == Dave Dykstra <[EMAIL PROTECTED]> writes:

Dave> The source currently in rsync's CVS already does this.
Dave> Martin put that in.

Appologies for the duplicate submission - I have several "local"
patches that I use, and I don't keep good records on which I've
already submitted.

>> (e.g. I seem to remember Dave Dykstra mentioning that certain
>> server errors were not logged to the client for security
>> reasons - I may have busted that, sorry).

Dave> Martin & Tridge will need to decide about that patch.  I
Dave> think Tridge will probably not want all the messages sent to
Dave> the client of a daemon.

Sounds totally reasonable to me.  Let me know if I can be of any help.

For the most part, the code behaves exactly as it did - I just tried
to "tidy it up" a bit.  The motivation, though, was trying to debug
the hang problem - I was trying to ensure that I was seeing every last
error message.  I believe that the only functional change comes from
the bit that tries the remote if (and only if) the sibling is toast
(and that bit is easy to disable by itself).

FWIW, I still get hangs w/daemon mode solaris26-to-solaris26 rsyncs of
large filesystems.  Owing to extreme administrative and political
difficulty of getting access to the far end server (sigh), I have been
unable to really pursue it any further.  I may try the recent patch
from Wayne Davison to see if that helps any.  (In my copious spare
time.)  In the meantime, I just do the syncs in smaller "chunks".


Regards,
Neil


-- 
Neil Schellenberger | Voice : (613) 599-2300 ext. 8445
Orchestream | Fax   : (613) 599-2330
350 Terry Fox Drive | E-Mail: [EMAIL PROTECTED]
Kanata ON, Canada, K2K 2W5  | URL   : http://www.orchestream.com/




Re: Improving the rsync protocol (RE: Rsync dies)

2002-05-22 Thread Neil Schellenberger
mming could be handled by a scanner which
 maintains a persistent data store of inodes, paths, checksums
 etc.  Perhaps a tie-in/integration with Tripwire, AIDE, or mhash?
 This might also be generally useful as a performance boost for
 those with large, relatively static, trees.  (Like me. :-)

  o  Jos's batch ideas could be implemented simply as a capture of the
 generator output to be played back into the sender.  If we could
 fix it so that the file could be played back in parts, we could
 take some of the sting out of protocol errors two hours into a
 big rsync.

  o  Perhaps the existing over-the-wire protocol could be emulated
 using a special "adaptor" process at the end of the sender
 pipeline?  I don't know how feasible it would be, but it would
 allow for backward OTW compatibility with older clients while
 allowing progress in the core.

To address the "Big Bang" problem, perhaps existing code could be
reused to provide intial implementations of each of the components?
Refactoring anyone?  (Have I used enough Software Engineering
buzzwords yet or what?  Ack.)

So, the overall goal would be to increase reliability and to encourage
third party auditing/reimplementation by providing (conceptually)
smaller, simpler, and more focused pipeline stages.

Many small tools!  Many small tools!  Many small tools!

[Thwack.  Ooof.  Sorry.  I feel much better now.]

Basically the less stuff in the core, the better the odds of being
able to get it working properly.


Regards,
Neil

P.S.  JW Schulz's and Bob Bagwill's posts came in while I was writing
this.  Some weird synergy thing going on out there

-- 
Neil Schellenberger  | Voice : (613) 599-2300 ext. 8445
Orchestream Americas Corp.   | Fax   : (613) 599-2330
350 Terry Fox Drive  | E-Mail: [EMAIL PROTECTED]
Kanata ON, Canada, K2K 2W5   | URL   : http://www.orchestream.com/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html