Hi

Resin license #1016608

Recently we've migrated to a ceph-based storage setup where we use NFS to allow 
our multiple Resin instances to write their logs directly to Ceph over NFS.

However, on 2 of our client-machines we see NFS stalls and retransmissions a 
few minutes past midnight for some files.

So far I've managed to write off posibility after posibility and right now I'm 
stuck a few last possibilities, and the next one in line is the way Resin does 
logrotation.

A poll of open files each second (via lsof output) shows this where out stdout 
logfile gets rotated:

COMMAND     PID          USER   FD      TYPE               DEVICE   SIZE/OFF    
       NODE NAME
Sun Feb  7 23:59:58 CET 2016
java      30865       results  238w      REG               0,25 1215099201 
1073774597 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log
Sun Feb  7 23:59:59 CET 2016
java      30865       results  238w      REG               0,25 1215100254 
1073774597 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log
Mon Feb  8 00:00:00 CET 2016
java      30865       results  107w      REG               0,25  233381888 
1073774632 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log.20160207
java      30865       results  118r      REG               0,25 1215103872 
1073774597 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log
Mon Feb  8 00:00:04 CET 2016
java      30865       results  107w      REG               0,25  929890304 
1073774632 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log.20160207
java      30865       results  118r      REG               0,25 1215103872 
1073774597 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log
Mon Feb  8 00:00:55 CET 2016
java      30865       results  161r      REG               0,25    1439432 
1073774597 
/nfsdata-web11nfs/results/logs/web9_results_backend1/resin-web-stdout.log

All day on the 7th the logfile is opened on filehandle 238 with write access 
(238w in the FD column)
Then around midnight a new file is created named resin-web-stdout.log.20160207 
on filehandle 107 with write permission (107w in the FD column).
At the same time it seems resin-web-stdout.log is opened with a new filehandle 
and only with read permission (118r).

Then over time data is copied from the file with handle 118r to 107w - the 
size/offset numbers increase between 00:00:00 and 00:00:04.

The stall I experience happens between 00:00:04 and 00:00:55 where lsof output 
stalls (the man-page mentions that lstat(2), readlink(2), and stat(2) calls are 
blocked if the NFS server is unresponsive).

Other open files in other Resin instances running on the same server does not 
see these stalls - nor does other client-machines, so I am confident that our 
NFS-server is alive and answering requests.


Now for the real questions:

Since we are running an ancient version of Resin (3.1.13 and 3.1.14), have the 
way of rotating logfiles changed in Resin 4.0 or later?

Are the stalls I'm seeing (I assume it could just as well happen locally if a 
disksubsystem was slow enough) a known issue that has been fixed in Resin 4.0 
or later?


Right now the best way of solving this is to change to syslog logging, but 
since we already have a roadmap for Resin 4 planned for later this year (yeay - 
finally!) this might just be enough for us to move it up to the top of our list.

Regards,
Jens Dueholm Christensen
Survey IT

P +45 5161 7879
j...@ramboll.com<mailto:j...@ramboll.com>
________________________________________

Rambøll
Olof Palmes Allé 20
DK-8200 Aarhus N
Denmark
www.ramboll.dk<http://www.ramboll-management.dk/>

_______________________________________________
resin-interest mailing list
resin-interest@caucho.com
http://maillist.caucho.com/mailman/listinfo/resin-interest

Reply via email to