On 09-Jul-2015 06:48, Stuart Barkley wrote:
Even though I doubt it is your problem, this smells similar to the
zone_reclaim_mode issues we saw last year.

You might check 'sar -B' output.  Specifically the 'pgscand/s' column.

Stays at 0, but see caveat below

Check the setting of /proc/sys/vm/zone_reclaim_mode (it should be 0).

It is.

The caveat - this morning I cannot make the tests go slow! Same account, same command, same input file. Apparently the issue depends on how the system was used previously and it sorts itself out, eventually, on an idle system. Before this problem was noticed 40 of the 48 nodes had each been used to generate and write one of these huge files (17.45GB). My testing of the read speed went on for about four hours after that, and it was uniformly slow for test files over the "just below 2^34 byte" limit for my account. The system then sat idle for about 15 hours, and now the performance issue isn't happening, not even on a test file twice the size of the largest attempted yesterday.

Interestingly, the "taskset" isn't needed now either. When the test program is run without it it runs nicely and no "migration/#" process ever pops up.

Seems like there is some sort of state that the earlier processing imposed on the system which caused the OS to be short of who knows what, triggering all of these issues when a lot of memory was needed on one CPU (or in one process).

I will re-abuse the system and see if that reintroduces the problem.

Thanks,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to