On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander <mag...@hagander.net>wrote:
> > On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakan...@vmware.com> > wrote: > > > > On 20.02.2013 02:11, Selena Deckelmann wrote: > >> > >> So, I just ran into a similar issue backing up a 9.2.1 server using > >> pg_basebackup version 9.2.3: > >> > >> pg_basebackup: starting background WAL receiver > >> pg_basebackup: streaming header too small: 25 > >> > >> > >> I've had it happen two times in a row. I'm going to try again... > >> > >> But -- what would be helpful here? I can recompile pg_basebackup with > more > >> debugging... > > > > > > Hmm, 25 bytes would be the size of the WAL data packet, if it contains > just the header and no actual WAL data. I think pg_basebackup should accept > that - it's not unreasonable that the server might send such a packet > sometimes. > > > > Looking at the walsender code, it's not supposed to ever send such a > packet. But I suspect there's one corner-case where it might: if the > current send location is at an xlogid boundary, so that we previously sent > the last byte from the last WAL segment in the previous logical xlog file, > and the WAL flush position points to byte 0 in the beginning of the new WAL > file. Both of those positions are in fact the same thing, but we have two > different ways to represent the same position. For example, if we've > already sent up to WAL position (sentPtr in walsender.c): > > > > xlogid = 4 > > xrecoff = XLogFileSize > > > > and GetFlushRecPtr() returns: > > > > xlogid = 5 > > xrecoff = 0 > > > > Those both point to the same position. But the check in XLogSend that > decides if there is any work to do uses XLByteLE() to check if they are > equal, and XLByteLE() treats the latter to be greater than the former. So, > in that situation, XLogSend() would decide that it has work to do, but > there actually isn't, so it would send 0 bytes of WAL data. > > > > I'm not sure how GetFlushRecPtr() could return such a position, though. > But I'm also not convinced that it can't happen. > > > > It would be fairly easy to fix walsender to not send anything in that > situation. It would also be easy to fix pg_basebackup to not treat it as an > error. We probably should do both. > > > > In 9.3, the XLogRecPtr representation changed so that there is only one > value for a boundary position like that, so this is a 9.2-only issue. > Thanks for the debugging, Heikki! > That does sound like a reasonable explanation and fix. Heck, probably > enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway. > > But I'd really like to confirm this is the actual situation before > considering it fixed, since it's clearly very intermittent. > > Selena, was this reasonably reproducible for you? Would it be possible to > get a network trace of it to show of that's the kind of package coming > across, or by hacking up pg_basebackup to print the exact position it was > at when the problem occurred? > This is happening with a very busy 700 GB system, so I'm going to rule out a network trace out for the moment. The error is occurring "sometime" in the middle of the backup. Last time it was at least 30-40 minutes into a 2 hr backup. I'll see about hacking up pg_basebackup. I'm doing nightly backups so hopefully (?) it will happen again in the next few days. -selena -- http://chesnok.com