On Fri, Aug 12, 2016 at 10:47 AM, Francisco Olarte <fola...@peoplecall.com> wrote:
> CCing to the list... > > Thanks > On Fri, Aug 12, 2016 at 4:10 PM, Ioana Danes <ioanada...@gmail.com> wrote: > >> given 318220 and 318216 are just a bit away ( 4db08/4db0c ), and it > >> repeats sporadically, have you ruled out ( by having page checksums or > >> other mechanism ) a potential disk read/write error ? > >> > >> > >> > Also the index is correct on db3 as the record in case (with drawid = > >> > 318216) is retrieved if I filter by drawid = 318220 > >> > >> Specially if this happens, you may have some slightly bad disks/ram/ > >> leading to this kind of problems. > >> > > > > Could be. I also had some issues with an rsync between db3 and drdb a > week > > ago that did not complete for bigger files (> 200MB) and gave me some > > corruption messages. Then the system was revbooted and everything seemed > > fine but apparently it is not. > > I am planning to drop & create the table from a good backup and if that > does > > not fix the issue then I will rebuild the server. > > I would check whatever logs you can ( syslog or eventlog, smart log, > etc.. ) hunting for disk errors ( sometimes they are reported ). This > kind of problems, with programs as tested as postgres and rsync, tend > to indicate controller/RAM/disk going bad ( in your case it could be > caused by a single bit getting flipped in a sector for the data > portion of the table, and not being propagated either because it > happened after your sync of drdb or because it was synced from the WAL > and not the table, or because it was read from the disk cache ). > > I agree, unfortunately I did not find any clues about corruption or any anomalies in the logs. I will work tonight to rebuild that table and see where I go from there. Thanks, ioana Francisco Olarte. >