2011/6/17 Антон Степаненко <zlobnyni...@yandex.ru>:
> 17.06.2011, 21:24, "Merlin Moncure" <mmonc...@gmail.com>:
>> 2011/6/17 Антон Степаненко <zlobnyni...@yandex.ru>;:
>>
>>>  17.06.2011, 20:19, "Merlin Moncure" <mmonc...@gmail.com>;:
>>>>  On Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
>>>>  <kevin.gritt...@wicourts.gov>;; wrote:
>>>>>>   I still do not believe that this is hardware problem.
>>>>>   How would an application cause a bus error?
>>>>  unaligned memory access on risc maybe?  what's this running on?
>>>>
>>>>  merlin
>>>  *****:~$ cat /proc/cpuinfo
>>>  processor       : 0
>>>  ....
>>>  processor       : 23
>>>  vendor_id       : GenuineIntel
>>>  cpu family      : 6
>>>  model           : 44
>>>  model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
>>
>> hm, I'm wondering if this
>> (http://us.generation-nt.com/bug-626451-linux-image-mremap-returns-useless-pages-moving-anonymous-shared-mmap-access-causes-sigbus-help-203302832.html)
>> has anything to do with your problem.
>>
>> merlin
>
> Thank you very much, very interesting link. I've compiled it under my ubuntu 
> lucid - it really causes sigbus. But when compiled under CentOS 2.6.18 - it 
> makes the same. So I am not sure that this is a bug.
> And event if it is - why it occurs only when buffers are set to 12Gb and 
> filled...
> I've read some sources of postgresql, e.g. /src/backend/storage/smgr/md.c:
> void
> mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>           char *buffer)
> {
> ..
> if (nbytes != BLCKSZ)
>        {
>                if (nbytes < 0)
>                        ereport(ERROR,
>                                        (errcode_for_file_access(),
>                                         errmsg("could not read block %u in 
> file \"%s\": %m",
>                                                        blocknum, 
> FilePathName(v->mdfd_vfd))));
>
>                /*
>                 * Short read: we are at or past EOF, or we read a partial 
> block at
>                 * EOF.  Normally this is an error; upper levels should never 
> try to
>                 * read a nonexistent block.  However, if zero_damaged_pages 
> is ON or
>                 * we are InRecovery, we should instead return zeroes without
>                 * complaining.  This allows, for example, the case of trying 
> to
>                 * update a block that was later truncated away.
>                 */
>                if (zero_damaged_pages || InRecovery)
>                        MemSet(buffer, 0, BLCKSZ);
>                else
>                        ereport(ERROR,
>                                        (errcode(ERRCODE_DATA_CORRUPTED),
>                                         errmsg("could not read block %u in 
> file \"%s\": read only %d of %d bytes",
>                                                        blocknum, 
> FilePathName(v->mdfd_vfd),
>                                                        nbytes, BLCKSZ)));
>        }
> }
>
> This is the only place reporting errors like 'could not read block in file'.
> Then I lookead at /src/backend/storage/file/fd.c:
> int
> FileRead(File file, char *buffer, int amount)
> {
> ..
> retry:
>        returnCode = read(VfdCache[file].fd, buffer, amount);
>
>        if (returnCode >= 0)
>                VfdCache[file].seekPos += returnCode;
>        else
>        {
>                /*
>                 * Windows may run out of kernel buffers and return 
> "Insufficient
>                 * system resources" error.  Wait a bit and retry to solve it.
>                 *
>                 * It is rumored that EINTR is also possible on some Unix 
> filesystems,
>                 * in which case immediate retry is indicated.
>                 */
> #ifdef WIN32
>                ...
> #endif
>                /* OK to retry if interrupted */
>                if (errno == EINTR)
>                        goto retry;
>
>                /* Trouble, so assume we don't know the file position anymore 
> */
>                VfdCache[file].seekPos = FileUnknownPos;
>        }
>
>        return returnCode;
> }
>
> First, comment started with 'It is rumored' looks suspiciosly =) But I am not 
> a kernel developer, I am event not a C++ developer, so I trust authors.
> I've read 'man read' and 'man 7 signal', and it is said that syscalls could 
> be interrupted by some signals, including sigbus, but when they do so, they 
> should return to normal behaviour.
> "the call will be automatically restarted after the signal handler returns if 
> the SA_RESTART flag was used; otherwise the call will fail with the error 
> EINTR" - from man 7 signal
> So as I far as I understand even if postgresql gets signal 7 it should 
> experience EINTR and retry immediately. What I am trying to say is that I do 
> not know why I am getting sigbus, but no matter where it comes from, 
> according to sources postgresql should just try to read one more time, and 
> one more, and so on until read succeeded. But I'm not quite sure what happens 
> first - sigbus or 'could not read block' error.

I wonder if you are oversubscribing your memory, and are getting weird
errors when reading data into memory because the pages can't be
reserved to do that.  What happens when you enable overcommit and
attempt to start the server?

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to