[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034904#comment-15034904
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
----------------------------------------------

I got another recreate of this and this time got a core file. And I was wrong 
originally. It's not a short read that is causing this. Instead the read is 
failing with a return code of -1 and errno is set to EBADF. The manpage for 
read(2) indicates this can only happen when:

{code}
       EBADF  fd is not a valid file descriptor or is not open for reading.
{code}

But we specifically opened it 2 lines of code above that and checked to ensure 
it wasn't -1. 

In the core file I also see that the fd is valid:

{code}
#4  0x00007f9ff8e4070a in setup_random () at src/zookeeper.c:476
476     in src/zookeeper.c
(gdb) print errno
$3 = 9
(gdb) print fd
$4 = 140
(gdb) print seed
$5 = 32671
{code}

It's odd that seed has something in it. That could mean we read _something_, 
but it could also be because this code never initialized seed to zero and it's 
got whatever garbage was on the stack.

The only other thing that's very curious here is that I think when this happens 
it coincides with a call to zookeeper_close. But this is a local stack variable 
so I can't fathom how that could cause this failure scenario.

I'll keep digging.

> assert in setup_random
> ----------------------
>
>                 Key: ZOOKEEPER-2311
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>            Reporter: Marshall McMullen
>         Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32          // TODO: better seed
>  531     int seed;
>  532     int fd = open("/dev/urandom", O_RDONLY);
>  533     if (fd == -1) {
>  534         seed = getpid();
>  535     } else {
>  536         int rc = read(fd, &seed, sizeof(seed));
>  537         assert(rc == sizeof(seed));
>  538         close(fd);
>  539     }
>  540     srandom(seed);
>  541     srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x00007f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x00007f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x00007f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x00007f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x00007f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x00007f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x00007f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x00007f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x00007f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x00007f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to