Every dog has its day! ;)
Prentice
On 07/09/2015 05:59 PM, James Cuff wrote:
Awesome!!!
With my job title most folks think I'm essentially technically
neutered these days.
Good to see there is still some life in this old dog :-)
Best,
J.
On Thursday, July 9, 2015, mathog <[email protected]
<mailto:[email protected]>> wrote:
On 09-Jul-2015 11:54, James Cuff wrote:
http://blog.jcuff.net/2015/04/of-huge-pages-and-huge-performance-hits.html
Well, that seems to be it, but not quite with the same symptoms
you observed. khugepaged never showed up, and "perf top" never
revealed _spin_lock_irqsave. Instead this is what "perf top"
shows in my tests:
(hugepage=always, when migration/# process observed)
89.97% [kernel] [k] compaction_alloc
1.21% [kernel] [k] compact_zone
1.18% [kernel] [k] get_pageblock_flags_group
0.75% [kernel] [k] __reset_isolation_suitable
0.57% [kernel] [k] clear_page_c_e
(hugepage=always, when events/# process observed)
85.97% [kernel] [k] compaction_alloc
0.84% [kernel] [k] compact_zone
0.65% [kernel] [k] get_pageblock_flags_group
0.64% perf [.] 0x000000000005cff7
(hugepage=never)
29.86% [kernel] [k] clear_page_c_e
21.88% [kernel] [k] copy_user_generic_string
12.46% [kernel] [k] __alloc_pages_nodemask
5.70% [kernel] [k] page_fault
This is good, because "perf top" shows that the underlying issue
is compaction_alloc and compact_zone even though what top shows
is in one case migration/# and when locked to a cpu, events/#.
Switching hugepage always->never seems to make things work right
away. Switching hugepage never->always seems to take a while to
break. In order to get it to start failing many of the big files
involved must be copied to /dev/null again, even though they were
presumably already in file cache.
Searched for "compaction_alloc" and "compact_zone" and found a
suggestion here
https://structureddata.github.io/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/
to do:
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
(transparent_hugepage is a link to redhat_transparent_hugepage).
Reenabled hugepage and reproduced the painfully slow IO, set
defrag to "never" and the IO was fast again, even though hugepage
was still enabled.
So on my machine the problem seems to be with hugepage defrag
specifically. Disabling just that is sufficient to resolve the
issue, it isn't necessary to take out all of hugepage. Will let
it run that way for a while and see if anything else shows up.
For future reference:
CentOS release 6.6 (Final)
kernel 2.6.32-504.23.4.el6.x86_64
Dell Inc. PowerEdge T620/03GCPM, BIOS 2.2.2 01/16/2014
48 Intel Xeon CPU E5-2695 v2 @ 2.40GHz (in /proc/cpuinfo)
RAM 529231456 kB (in /proc/meminfo)
Thanks all!
David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
--
(Via iPhone)
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf