[ 
https://issues.apache.org/jira/browse/TS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669893#comment-13669893
 ] 

John Plevyak commented on TS-1648:
----------------------------------

Rather than long we should be using int64 as "long" is not well defined (it is 
platform dependent). Are those 10TB RAIDs?  If so, you are better of using them 
as JBOD since ATS assumes that there is a single disk arm (or equal fraction) 
for each "disk" is storage.config.  Because of the size of your "disk" it is 
possible that you have more than 2^31 directory entries which would account for 
the overflow.  Also, given the size, the "clear" may take a long time.  Your 
trace is not long enough for me to see if it repeats.  However, if it does 
repeat, it is possible that it is because dir_in_bucket also takes an int which 
is then multiplied to get a directory number.  The other possibility is (of 
course) that you have memory corruption: the directory is the single largest 
memory user, and it contains a linked list which can be circularized by 
corruption, but let's concentrate on the other issues first.

I would suggest that we change all the bucket/entry/etc offsets to int64 (I can 
build a patch, but I would appreciate a review).  Second, I would suggest 
(after testing to ensure that the patch fixes your problem) that you move to 
JBOD rather than RAID-0 or to having multiple NAS volumes which correspond 
approximately to the number of underlying disks since ATS will only have one 
outstanding write (although multiple reads) for each "disk" in storage.config.  
                
> Segmentation fault in dir_clear_range()
> ---------------------------------------
>
>                 Key: TS-1648
>                 URL: https://issues.apache.org/jira/browse/TS-1648
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>    Affects Versions: 3.3.0, 3.2.0
>         Environment: reverse proxy
>            Reporter: Tomasz Kuzemko
>            Assignee: weijin
>              Labels: A
>             Fix For: 3.3.3
>
>         Attachments: 
> 0001-Fix-for-TS-1648-Segmentation-fault-in-dir_clear_rang.patch
>
>
> I use ATS as a reverse proxy. I have a fairly large disk cache consisting of 
> 2x 10TB raw disks. I do not use cache compression. After a few days of 
> running (this is a dev machine - not handling any traffic) ATS begins to 
> crash with a segfault shortly after start:
> [Jan 11 16:11:00.690] Server {0x7ffff2bb8700} DEBUG: (rusage) took rusage 
> snap 1357917060690487000
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff20ad700 (LWP 17292)]
> 0x0000000000696a71 in dir_clear_range (start=640, end=17024, vol=0x16057d0) 
> at CacheDir.cc:382
> 382   CacheDir.cc: No such file or directory.
>       in CacheDir.cc
> (gdb) p i
> $1 = 214748365
> (gdb) l
> 377   in CacheDir.cc
> (gdb) p dir_index(vol, i)
> $2 = (Dir *) 0x7ff997a04002
> (gdb) p dir_index(vol, i-1)
> $3 = (Dir *) 0x7ffa97a03ff8
> (gdb) p *dir_index(vol, i-1)
> $4 = {w = {0, 0, 0, 0, 0}}
> (gdb) p *dir_index(vol, i-2)
> $5 = {w = {0, 0, 52431, 52423, 0}}
> (gdb) p *dir_index(vol, i)
> Cannot access memory at address 0x7ff997a04002
> (gdb) p *dir_index(vol, i+2)
> Cannot access memory at address 0x7ff997a04016
> (gdb) p *dir_index(vol, i+1)
> Cannot access memory at address 0x7ff997a0400c
> (gdb) p vol->buckets * DIR_DEPTH * vol->segments
> $6 = 1246953472
> (gdb) bt
> #0  0x0000000000696a71 in dir_clear_range (start=640, end=17024, 
> vol=0x16057d0) at CacheDir.cc:382
> #1  0x000000000068aba2 in Vol::handle_recover_from_data (this=0x16057d0, 
> event=3900, data=0x16058a0) at Cache.cc:1384
> #2  0x00000000004e8e1c in Continuation::handleEvent (this=0x16057d0, 
> event=3900, data=0x16058a0) at ../iocore/eventsystem/I_Continuation.h:146
> #3  0x0000000000692385 in AIOCallbackInternal::io_complete (this=0x16058a0, 
> event=1, data=0x135afc0) at ../../iocore/aio/P_AIO.h:80
> #4  0x00000000004e8e1c in Continuation::handleEvent (this=0x16058a0, event=1, 
> data=0x135afc0) at ../iocore/eventsystem/I_Continuation.h:146
> #5  0x0000000000700fec in EThread::process_event (this=0x7ffff36c4010, 
> e=0x135afc0, calling_code=1) at UnixEThread.cc:142
> #6  0x00000000007011ff in EThread::execute (this=0x7ffff36c4010) at 
> UnixEThread.cc:191
> #7  0x00000000006ff8c2 in spawn_thread_internal (a=0x1356040) at Thread.cc:88
> #8  0x00007ffff797e8ca in start_thread () from /lib/libpthread.so.0
> #9  0x00007ffff55c6b6d in clone () from /lib/libc.so.6
> #10 0x0000000000000000 in ?? ()
> This is fixed by running "traffic_server -Kk" to clear the cache. But after a 
> few days the issue reappears.
> I will keep the current faulty setup as-is in case you need me to provide 
> more data. I tried to make a core dump but it took a couple of GB even after 
> gzip (I can however provide it on request).
> *Edit*
> OS is Debian GNU/Linux 6.0.6 with custom built kernel 
> 3.2.13-grsec-xxxx-grs-ipv6-64

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to