Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Hiro Yoshioka <[EMAIL PROTECTED]> Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Date: Fri, 02 Sep 2005 13:37:16 +0900 (JST) Message-ID: <[EMAIL PROTECTED]> > From: Andrew Morton <[EMAIL PROTECTED]> > > Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > > > > > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 2005-08-05 > > > 16:04:37.0 +0900 > > > +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 > > > 17:09:41.0 +0900 > > > > Really. Please redo and retest the patch against a current kernel. > > Does it mean 2.6.13? I'll do it. > > Regards, > Hiro Hi, The following is the patch against 2.6.13 Hiro diff -ur linux-2.6.13/Makefile linux-2.6.13.nt/Makefile --- linux-2.6.13/Makefile 2005-08-29 08:41:01.0 +0900 +++ linux-2.6.13.nt/Makefile2005-09-03 14:11:27.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 13 -EXTRAVERSION = +EXTRAVERSION = .nt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.13/arch/i386/lib/usercopy.c linux-2.6.13.nt/arch/i386/lib/usercopy.c --- linux-2.6.13/arch/i386/lib/usercopy.c 2005-08-29 08:41:01.0 +0900 +++ linux-2.6.13.nt/arch/i386/lib/usercopy.c2005-09-03 14:09:18.0 +0900 @@ -425,6 +425,107 @@ : "eax", "edx", "memory"); return size; } + +/* Non Temporal Hint version of __copy_user_zeroing_intel */ +/* It is cache aware. */ +/* [EMAIL PROTECTED] */ +static unsigned long +__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size) +{ +int d0, d1; + + __asm__ __volatile__( + ".align 2,0x90\n" + "0: movl 32(%4), %%eax\n" + "cmpl $67, %0\n" + "jbe 2f\n" + "1: movl 64(%4), %%eax\n" + ".align 2,0x90\n" + "2: movl 0(%4), %%eax\n" + "21: movl 4(%4), %%edx\n" + "movnti %%eax, 0(%3)\n" + "movnti %%edx, 4(%3)\n" + "3: movl 8(%4), %%eax\n" + "31: movl 12(%4),%%edx\n" + "movnti %%eax, 8(%3)\n" + "movnti %%edx, 12(%3)\n" + "4: movl 16(%4), %%eax\n" + "41: movl 20(%4), %%edx\n" + "movnti %%eax, 16(%3)\n" + "movnti %%edx, 20(%3)\n" + "10: movl 24(%4), %%eax\n" + "51: movl 28(%4), %%edx\n" + "movnti %%eax, 24(%3)\n" + "movnti %%edx, 28(%3)\n" + "11: movl 32(%4), %%eax\n" + "61: movl 36(%4), %%edx\n" + "movnti %%eax, 32(%3)\n" + "movnti %%edx, 36(%3)\n" + "12: movl 40(%4), %%eax\n" + "71: movl 44(%4), %%edx\n" + "movnti %%eax, 40(%3)\n" + "movnti %%edx, 44(%3)\n" + "13: movl 48(%4), %%eax\n" + "81: movl 52(%4), %%edx\n" + "movnti %%eax, 48(%3)\n" + "movnti %%edx, 52(%3)\n" + "14: movl 56(%4), %%eax\n" + "91: movl 60(%4), %%edx\n" + "movnti %%eax, 56(%3)\n" + "movnti %%edx, 60(%3)\n" + "addl $-64, %0\n" + "addl $64, %4\n" + "addl $64, %3\n" + "cmpl $63, %0\n" + "ja 0b\n" + "sfence \n" + "5: movl %0, %%eax\n" + "shrl $2, %0\n" + "andl $3, %%eax\n" + "cld\n" + "6: rep; movsl\n" +
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Andrew Morton <[EMAIL PROTECTED]> > Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > > > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 > > 16:04:37.0 +0900 > > +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 > > 17:09:41.0 +0900 > > Really. Please redo and retest the patch against a current kernel. Does it mean 2.6.13? I'll do it. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 2005-08-05 > 16:04:37.0 +0900 > +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 > 17:09:41.0 +0900 Really. Please redo and retest the patch against a current kernel. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Andrew, From: Andrew Morton <[EMAIL PROTECTED]> > Andi Kleen <[EMAIL PROTECTED]> wrote: > > > > On Friday 02 September 2005 04:08, Andrew Morton wrote: > > > > > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious > > > about the whole idea... We'll gain some and we'll lose some - how do we > > > know it's a net gain? > > > > I suspect it'll gain more than it loses. The only case where it might > > not gain is immediately someone reading the data from the page cache again > > after the write. > > That's a pretty common case - temporary files. > > > But I suppose that's far less frequent than writing the data. > > yup. > > Hiro, could you please send through a summary of the performance testing > results sometime? Runtimes rather than oprofile output? iozone results are original 2.6.12.4 CPU time = 207.768 sec cache aware CPU time = 184.783 sec (three times run) 184.783/207.768=88.94% (11.06% reduction) original: pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527CPU utilization 140.28 % pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933CPU utilization 153.45 % pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308CPU utilization 157.93 % cache awre: pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465CPU utilization 139.30 % pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273CPU utilization 132.55 % pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045CPU utilization 142.10 % Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Andi Kleen <[EMAIL PROTECTED]> wrote: > > On Friday 02 September 2005 04:08, Andrew Morton wrote: > > > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious > > about the whole idea... We'll gain some and we'll lose some - how do we > > know it's a net gain? > > I suspect it'll gain more than it loses. The only case where it might > not gain is immediately someone reading the data from the page cache again > after the write. That's a pretty common case - temporary files. > But I suppose that's far less frequent than writing the data. yup. Hiro, could you please send through a summary of the performance testing results sometime? Runtimes rather than oprofile output? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Friday 02 September 2005 04:08, Andrew Morton wrote: > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious > about the whole idea... We'll gain some and we'll lose some - how do we > know it's a net gain? I suspect it'll gain more than it loses. The only case where it might not gain is immediately someone reading the data from the page cache again after the write. But I suppose that's far less frequent than writing the data. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > From: Andi Kleen <[EMAIL PROTECTED]> > > On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote: > > > > > The following is the almost final version of the > > > cache pollution aware __copy_from_user_ll() patch. > > > > Looks good to me. > > > > Once the filemap.c hunk is in I'll probably do something > > similar for x86-64. > > Thank you very much. What else should I do? Shall I just > be waiting to check in the patch? > I suppose I'll queue it up in -mm for a while, although I'm a bit dubious about the whole idea... We'll gain some and we'll lose some - how do we know it's a net gain? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Friday 02 September 2005 03:43, Hiro Yoshioka wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > > > On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote: > > > The following is the almost final version of the > > > cache pollution aware __copy_from_user_ll() patch. > > > > Looks good to me. > > > > Once the filemap.c hunk is in I'll probably do something > > similar for x86-64. > > Thank you very much. What else should I do? Shall I just > be waiting to check in the patch? I suppose Andrew will take care of it, unless someone else objects. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Andi Kleen <[EMAIL PROTECTED]> > On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote: > > > The following is the almost final version of the > > cache pollution aware __copy_from_user_ll() patch. > > Looks good to me. > > Once the filemap.c hunk is in I'll probably do something > similar for x86-64. Thank you very much. What else should I do? Shall I just be waiting to check in the patch? Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Thursday 01 September 2005 11:07, Hiro Yoshioka wrote: > The following is the almost final version of the > cache pollution aware __copy_from_user_ll() patch. Looks good to me. Once the filemap.c hunk is in I'll probably do something similar for x86-64. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, > From: Andi Kleen <[EMAIL PROTECTED]> > > > Hi, > > > > > > The following patch does not use MMX regsiters so that we don't have > > > to worry about save/restore the FPU/MMX states. > > > > > > What do you think? > > > > Performance will probably be bad on K7 Athlons - those have a microcoded > > movnti which is quite slow. > > > > Also BTW I don't see any code anywhere that tests the CPUID bits, > > so your code will fail spectacularly on a PII that didn't do SSE > > (intel user copy used to be enabled on those) > > > > One way to solve this might be to use different code using > > alternative() > > > > -Andi The following is the almost final version of the cache pollution aware __copy_from_user_ll() patch. 1) use sfence instruction to perform a serializing on all store-to-memory instructions. 2) check if the cpu has the xmm2 extentions. (movnti) I think it is a good enough to be considered into the main line. What do you think? Some performance data are Total of GLOBAL_POWER_EVENTS (CPU cycle samples) 2.6.12.4.orig1921587 2.6.12.4.nt 1599424 1599424/1921587=83.23% (16.77% reduction) BSQ_CACHE_REFERENCE (L3 cache miss) 2.6.12.4.orig 57427 2.6.12.4.nt20858 20858/57427=36.32% (63.7% reduction) L3 cache miss reduction of __copy_from_user_ll samples % 3740865.1412 vmlinux __copy_from_user_ll 230.1103 vmlinux __copy_user_zeroing_intel_nocache 23/37408=0.061% (99.94% reduction) Top 5 of 2.6.12.4.nt Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1283928.0274 vmlinux __copy_user_zeroing_intel_nocache 64206 4.0143 vmlinux journal_add_journal_head 59746 3.7355 vmlinux do_get_write_access 47674 2.9807 vmlinux journal_put_journal_head 46021 2.8774 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-09011728/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache 55685 3.4215 vmlinux journal_add_journal_head 52371 3.2179 vmlinux __find_get_block 45504 2.7960 vmlinux journal_put_journal_head 36005 2.2123 vmlinux journal_stop pattern9-0-cpu4-0-09011744/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 1147 5.4994 vmlinux journal_add_journal_head 881 4.2240 vmlinux journal_dirty_data 872 4.1809 vmlinux blk_rq_map_sg 734 3.5192 vmlinux journal_commit_transaction 617 2.9582 vmlinux radix_tree_delete pattern9-0-cpu4-0-09011731/summary.out diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nt/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.nt/Makefile 2005-08-24 17:23:57.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.nt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nt/arch/i386/lib/usercopy.c --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 16:04:37.0 +0900 +++ linux-2.6.12.4.nt/arch/i386/lib/usercopy.c 2005-09-01 17:09:41.0 +0900 @@ -421,6 +421,107 @@ : "eax", "edx", "memory"); return size; } + +/* Non Temporal Hint version of __copy_user_zeroing_intel */ +/* It is cache aware. */ +/* [EMAIL PROTECTED] */ +static unsigned long +__copy_user_zeroing_intel_nocache(void *to, const void __user *from, unsigned long size) +{ +int d0, d1; + + __asm__ __volatile__( + ".align 2,0x90\n" + "0: movl 32(%4), %%eax\n" + "cmpl $67, %0\n" + "jbe 2f\n" + "1: movl 64(%4), %%eax\n" + ".align 2,0x90\n" + "2: movl 0(%4), %%eax\n" + "21: movl 4(%4), %%edx\n" + "movnti %%eax, 0(%3)\n" + "movnti %%edx, 4(%3)\n" + "3: movl 8(%4), %%eax\n" + "31: movl 12(%4),%%edx\n" + "movnti %%eax, 8(%3)\n" + "
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Andi Kleen <[EMAIL PROTECTED]> > > Hi, > > > > The following patch does not use MMX regsiters so that we don't have > > to worry about save/restore the FPU/MMX states. > > > > What do you think? > > Performance will probably be bad on K7 Athlons - those have a microcoded > movnti which is quite slow. > > Also BTW I don't see any code anywhere that tests the CPUID bits, > so your code will fail spectacularly on a PII that didn't do SSE > (intel user copy used to be enabled on those) > > One way to solve this might be to use different code using > alternative() > > -Andi Thanks for your comments. I'll consider it. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Hirokazu Takahashi <[EMAIL PROTECTED]> > > The following patch does not use MMX regsiters so that we don't have > > to worry about save/restore the FPU/MMX states. > > > > What do you think? > > I think __copy_user_zeroing_intel_nocache() should be followed by sfence > or mfence instruction to flush the data. Thanks. I'll implement it. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, > The following patch does not use MMX regsiters so that we don't have > to worry about save/restore the FPU/MMX states. > > What do you think? I think __copy_user_zeroing_intel_nocache() should be followed by sfence or mfence instruction to flush the data. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hiro Yoshioka <[EMAIL PROTECTED]> writes: > Hi, > > The following patch does not use MMX regsiters so that we don't have > to worry about save/restore the FPU/MMX states. > > What do you think? Performance will probably be bad on K7 Athlons - those have a microcoded movnti which is quite slow. Also BTW I don't see any code anywhere that tests the CPUID bits, so your code will fail spectacularly on a PII that didn't do SSE (intel user copy used to be enabled on those) One way to solve this might be to use different code using alternative() -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Wed, 2005-08-24 at 23:11 +0900, Hiro Yoshioka wrote: > Hi, > > The following patch does not use MMX regsiters so that we don't have > to worry about save/restore the FPU/MMX states. > > What do you think? excellent! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, The following patch does not use MMX regsiters so that we don't have to worry about save/restore the FPU/MMX states. What do you think? Some performance data are Total of GLOBAL_POWER_EVENTS (CPU cycle samples) 2.6.12.4.orig1921587 2.6.12.4.nt 1688900 1688900/1921587=87.89% (12.1% reduction) BSQ_CACHE_REFERENCE (L3 cache miss) 2.6.12.4.orig 57427 2.6.12.4.preempt 17122 17122/57427=29.81% (70.18% reduction) L3 cache miss reduction of __copy_from_user_ll samples % 3740865.1412 vmlinux __copy_from_user_ll 240.1402 vmlinux __copy_user_zeroing_intel_nocache 24/37408=0.064% (99.93% reduction) > Top 5 2.6.12.4.orig > Counted GLOBAL_POWER_EVENTS events (time during which processor is not > stopped) with a unit mask of 0x01 (mandatory) count 10 > samples %app name symbol name > 287643 14.9692 vmlinux __copy_from_user_ll > 72660 3.7813 vmlinux journal_add_journal_head > 65011 3.3832 vmlinux do_get_write_access > 50618 2.6342 vmlinux journal_put_journal_head > 48068 2.5015 vmlinux journal_dirty_metadata > pattern9-0-cpu4-0-08191743/summary.out > > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x3f (multiple flags) count 3000 > samples %app name symbol name > 1347567.9364 vmlinux __copy_from_user_ll > 57735 3.4003 vmlinux journal_add_journal_head > 50653 2.9832 vmlinux __find_get_block > 44522 2.6221 vmlinux journal_put_journal_head > 38928 2.2927 vmlinux journal_dirty_metadata > pattern9-0-cpu4-0-08191741/summary.out > > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x200 (read 3rd level cache miss) count 3000 > samples %app name symbol name > 3740865.1412 vmlinux __copy_from_user_ll > 953 1.6595 vmlinux blk_rq_map_sg > 886 1.5429 vmlinux sub_preempt_count > 680 1.1841 vmlinux journal_add_journal_head > 598 1.0413 vmlinux journal_commit_transaction > pattern9-0-cpu4-0-08191720/summary.out > The following data is an implementation without the MMX registers. Top 5 2.6.12.4.nt Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1377448.1560 vmlinux __copy_user_zeroing_intel_nocache 68723 4.0692 vmlinux do_get_write_access 65808 3.8966 vmlinux journal_add_journal_head 50373 2.9826 vmlinux journal_dirty_metadata 49038 2.9036 vmlinux journal_put_journal_head pattern9-0-cpu4-0-08242225/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 62165 3.7913 vmlinux __copy_user_zeroing_intel_nocache 57862 3.5289 vmlinux journal_add_journal_head 54230 3.3073 vmlinux __find_get_block 48335 2.9478 vmlinux journal_put_journal_head 35737 2.1795 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08242152/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 867 5.0637 vmlinux blk_rq_map_sg 694 4.0533 vmlinux journal_add_journal_head 629 3.6736 vmlinux journal_commit_transaction 624 3.6444 vmlinux radix_tree_delete 525 3.0662 vmlinux release_pages pattern9-0-cpu4-0-08242147/summary.out The following is MMX version of cache aware implementation. > Top 5 2.6.12.4.preempt > Counted GLOBAL_POWER_EVENTS events (time during which processor is not > stopped) with a unit mask of 0x01 (mandatory) count 10 > samples %app name symbol name > 1235317.5582 vmlinux > __copy_user_zeroing_inatomic_nocache > 64820 3.9660 vmlinux journal_add_journal_head > 60460 3.6992 vmlinux do_get_write_access > 47172 2.8862 vmlinux journal_put_journal_head > 46753 2.8606 vmlinux journal_dirty_metadata > pattern9-0-cpu4-0-08190838/summary.out > > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x3f (multiple flags) count 3000 > sample
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, It seems to me this mail does not go out. So resending it. > On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > 1) using stack to save/restore MMX registers > > It seems to me that it has some regression. > I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end(). The following is a current version of cache aware copy_from_user_ll. 1) using kernel_fpu_begin()/kernel_fpu_end() 2) low latency version of cache aware copy 3) __copy_user*_nocache APIs so if you want to use it. (There is no change in the current APIs.) Some performance data are Total of GLOBAL_POWER_EVENTS (CPU cycle samples) 2.6.12.4.orig1921587 2.6.12.4.preempt 1634411 163411/1921587=85.06% (15% reduction) BSQ_CACHE_REFERENCE (L3 cache miss) 2.6.12.4.orig 57427 2.6.12.4.preempt 17398 samples % 3740865.1412 vmlinux __copy_from_user_ll 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache 51/37408=0.136% (99.86% reduction) Top 5 2.6.12.4.orig Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 287643 14.9692 vmlinux __copy_from_user_ll 72660 3.7813 vmlinux journal_add_journal_head 65011 3.3832 vmlinux do_get_write_access 50618 2.6342 vmlinux journal_put_journal_head 48068 2.5015 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191743/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1347567.9364 vmlinux __copy_from_user_ll 57735 3.4003 vmlinux journal_add_journal_head 50653 2.9832 vmlinux __find_get_block 44522 2.6221 vmlinux journal_put_journal_head 38928 2.2927 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191741/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 3740865.1412 vmlinux __copy_from_user_ll 953 1.6595 vmlinux blk_rq_map_sg 886 1.5429 vmlinux sub_preempt_count 680 1.1841 vmlinux journal_add_journal_head 598 1.0413 vmlinux journal_commit_transaction pattern9-0-cpu4-0-08191720/summary.out Top 5 2.6.12.4.preempt Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1235317.5582 vmlinux __copy_user_zeroing_inatomic_nocache 64820 3.9660 vmlinux journal_add_journal_head 60460 3.6992 vmlinux do_get_write_access 47172 2.8862 vmlinux journal_put_journal_head 46753 2.8606 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08190838/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1267626.7993 vmlinux __copy_user_zeroing_inatomic_nocache 79803 4.2805 vmlinux journal_add_journal_head 70271 3.7692 vmlinux journal_dirty_metadata 66146 3.5480 vmlinux __find_get_block 58082 3.1154 vmlinux journal_put_journal_head pattern9-0-cpu4-0-08190855/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 901 5.1788 vmlinux blk_rq_map_sg 675 3.8798 vmlinux journal_commit_transaction 637 3.6613 vmlinux radix_tree_delete 605 3.4774 vmlinux journal_add_journal_head 580 3.3337 vmlinux release_pages ... 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache ... 1 0.0057 vmlinux __copy_from_user_ll_inatomic_nocache pattern9-0-cpu4-0-08190859/summary.out 2.6.12.4-usercopy.c.patch.050819 diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.preempt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/us
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, It seems to me this mail does not go out. So resending it. > On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > 1) using stack to save/restore MMX registers > > It seems to me that it has some regression. > I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end(). The following is a current version of cache aware copy_from_user_ll. 1) using kernel_fpu_begin()/kernel_fpu_end() 2) low latency version of cache aware copy 3) __copy_user*_nocache APIs so if you want to use it. (There is no change in the current APIs.) Some performance data are Total of GLOBAL_POWER_EVENTS (CPU cycle samples) 2.6.12.4.orig1921587 2.6.12.4.preempt 1634411 163411/1921587=85.06% (15% reduction) BSQ_CACHE_REFERENCE (L3 cache miss) 2.6.12.4.orig 57427 2.6.12.4.preempt 17398 samples % 3740865.1412 vmlinux __copy_from_user_ll 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache 51/37408=0.136% (99.86% reduction) Top 5 2.6.12.4.orig Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 287643 14.9692 vmlinux __copy_from_user_ll 72660 3.7813 vmlinux journal_add_journal_head 65011 3.3832 vmlinux do_get_write_access 50618 2.6342 vmlinux journal_put_journal_head 48068 2.5015 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191743/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1347567.9364 vmlinux __copy_from_user_ll 57735 3.4003 vmlinux journal_add_journal_head 50653 2.9832 vmlinux __find_get_block 44522 2.6221 vmlinux journal_put_journal_head 38928 2.2927 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191741/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 3740865.1412 vmlinux __copy_from_user_ll 953 1.6595 vmlinux blk_rq_map_sg 886 1.5429 vmlinux sub_preempt_count 680 1.1841 vmlinux journal_add_journal_head 598 1.0413 vmlinux journal_commit_transaction pattern9-0-cpu4-0-08191720/summary.out Top 5 2.6.12.4.preempt Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1235317.5582 vmlinux __copy_user_zeroing_inatomic_nocache 64820 3.9660 vmlinux journal_add_journal_head 60460 3.6992 vmlinux do_get_write_access 47172 2.8862 vmlinux journal_put_journal_head 46753 2.8606 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08190838/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1267626.7993 vmlinux __copy_user_zeroing_inatomic_nocache 79803 4.2805 vmlinux journal_add_journal_head 70271 3.7692 vmlinux journal_dirty_metadata 66146 3.5480 vmlinux __find_get_block 58082 3.1154 vmlinux journal_put_journal_head pattern9-0-cpu4-0-08190855/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 901 5.1788 vmlinux blk_rq_map_sg 675 3.8798 vmlinux journal_commit_transaction 637 3.6613 vmlinux radix_tree_delete 605 3.4774 vmlinux journal_add_journal_head 580 3.3337 vmlinux release_pages ... 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache ... 1 0.0057 vmlinux __copy_from_user_ll_inatomic_nocache pattern9-0-cpu4-0-08190859/summary.out 2.6.12.4-usercopy.c.patch.050819 diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.preempt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
> On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > > 1) using stack to save/restore MMX registers > > It seems to me that it has some regression. > I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end(). The following is a current version of cache aware copy_from_user_ll. 1) using kernel_fpu_begin()/kernel_fpu_end() 2) low latency version of cache aware copy 3) __copy_user*_nocache APIs so if you want to use it. (There is no change in the current APIs.) Some performance data are Total of GLOBAL_POWER_EVENTS (CPU cycle samples) 2.6.12.4.orig1921587 2.6.12.4.preempt 1634411 163411/1921587=85.06% (15% reduction) BSQ_CACHE_REFERENCE (L3 cache miss) 2.6.12.4.orig 57427 2.6.12.4.preempt 17398 samples % 3740865.1412 vmlinux __copy_from_user_ll 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache 51/37408=0.136% (99.86% reduction) Top 5 2.6.12.4.orig Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 287643 14.9692 vmlinux __copy_from_user_ll 72660 3.7813 vmlinux journal_add_journal_head 65011 3.3832 vmlinux do_get_write_access 50618 2.6342 vmlinux journal_put_journal_head 48068 2.5015 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191743/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1347567.9364 vmlinux __copy_from_user_ll 57735 3.4003 vmlinux journal_add_journal_head 50653 2.9832 vmlinux __find_get_block 44522 2.6221 vmlinux journal_put_journal_head 38928 2.2927 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08191741/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 3740865.1412 vmlinux __copy_from_user_ll 953 1.6595 vmlinux blk_rq_map_sg 886 1.5429 vmlinux sub_preempt_count 680 1.1841 vmlinux journal_add_journal_head 598 1.0413 vmlinux journal_commit_transaction pattern9-0-cpu4-0-08191720/summary.out Top 5 2.6.12.4.preempt Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1235317.5582 vmlinux __copy_user_zeroing_inatomic_nocache 64820 3.9660 vmlinux journal_add_journal_head 60460 3.6992 vmlinux do_get_write_access 47172 2.8862 vmlinux journal_put_journal_head 46753 2.8606 vmlinux journal_dirty_metadata pattern9-0-cpu4-0-08190838/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 samples %app name symbol name 1267626.7993 vmlinux __copy_user_zeroing_inatomic_nocache 79803 4.2805 vmlinux journal_add_journal_head 70271 3.7692 vmlinux journal_dirty_metadata 66146 3.5480 vmlinux __find_get_block 58082 3.1154 vmlinux journal_put_journal_head pattern9-0-cpu4-0-08190855/summary.out Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %app name symbol name 901 5.1788 vmlinux blk_rq_map_sg 675 3.8798 vmlinux journal_commit_transaction 637 3.6613 vmlinux radix_tree_delete 605 3.4774 vmlinux journal_add_journal_head 580 3.3337 vmlinux release_pages ... 510.2931 vmlinux __copy_user_zeroing_inatomic_nocache ... 1 0.0057 vmlinux __copy_from_user_ll_inatomic_nocache pattern9-0-cpu4-0-08190859/summary.out 2.6.12.4-usercopy.c.patch.050819 diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.preempt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c --- lin
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
> 2) low latency version of cache aware copy Having a low latency version that is only active with CONFIG_PREEMPT is bad - non preempt kernels need good latency too. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, On 8/18/05, Hiro Yoshioka <[EMAIL PROTECTED]> wrote: > 1) using stack to save/restore MMX registers It seems to me that it has some regression. I'd like to rollback it and use kernel_fpu_begin() and kernel_fpu_end(). Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Thu, 2005-08-18 at 00:27 +0900, Akira Tsukamoto wrote: > My computer with Athlon K7 was faster with manually prefetching, > but I did not know it is already becoming obsolete. > Don't listen to people who tell you $FOO hardware is obsolete, they have a very narrow view. "Obsolete" is meaningless except in reference to some specific application. The 386 is obsolete on the desktop but still common on the embedded market. Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
> So I make two APIs. > __copy_user_zeroing_nocache() > __copy_user_zeroing_inatomic_nocache() > > The former is a low latency version and the other is a throughput version. 1) using stack to save/restore MMX registers 2) low latency version of cache aware copy 3) __copy_user*_nocache APIs so if you want to use it. diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.preempt/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.preempt/Makefile 2005-08-18 18:47:07.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.preempt NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 16:04:37.0 +0900 +++ linux-2.6.12.4.preempt/arch/i386/lib/usercopy.c 2005-08-18 19:07:49.0 +0900 @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -511,6 +512,254 @@ : "memory");\ } while (0) +#define MMX_SAVE do { \ +preempt_disable(); \ +__asm__ __volatile__ ( \ +"movl %%cr0,%0 ;\n\t" \ +"clts ;\n\t" \ +"movq %%mm0,(%1) ;\n\t" \ +"movq %%mm1,8(%1) ;\n\t" \ +"movq %%mm2,16(%1) ;\n\t" \ +"movq %%mm3,24(%1) ;\n\t" \ +: "=&r" (cr0) \ +: "r" (mmx_save)\ +: "memory");\ +} while(0) + +#define MMX_RESTORE do { \ +__asm__ __volatile__ ( \ +"sfence ;\n\t" \ +"movq (%1),%%mm0 ;\n\t" \ +"movq 8(%1),%%mm1 ;\n\t" \ +"movq 16(%1),%%mm2 ;\n\t" \ +"movq 24(%1),%%mm3 ;\n\t" \ +"movl %0,%%cr0;\n\t" \ +: \ +: "r" (cr0), "r" (mmx_save) \ +: "memory");\ +preempt_enable(); \ +} while(0) + +#define ALIGN8 __attribute__((aligned(8))) + +/* Non Temporal Hint version of mmx_memcpy */ +/* It is cache aware */ +/* [EMAIL PROTECTED] */ +static unsigned long +__copy_user_zeroing_nocache(void *to, const void *from, size_t len) +{ +/* Note! gcc doesn't seem to align stack variables properly, so we + * need to make use of unaligned loads and stores. + */ + void *p; + int i; +char mmx_save[8*4] ALIGN8; +int cr0; + + if (unlikely(in_interrupt())){ + __copy_user_zeroing(to, from, len); + return len; + } + + p = to; + i = len >> 6; /* len/64 */ + + /*kernel_fpu_begin();*/ + MMX_SAVE; + + __asm__ __volatile__ ( + "1: prefetchnta (%0)\n" /* This set is 28 bytes */ + " prefetchnta 64(%0)\n" + " prefetchnta 128(%0)\n" + " prefetchnta 192(%0)\n" + " prefetchnta 256(%0)\n" + "2: \n" + ".section .fixup, \"ax\"\n" + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */ + " jmp 2b\n" + ".previous\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b, 3b\n" + ".previous" + : : "r" (from) ); + + for(; i>5; i--) + { + __asm__ __volatile__ ( + "1: prefetchnta 320(%0)\n" +"2: movq (%0), %%mm0\n" +" movq 8(%0), %%mm1\n" +" movq 16(%0), %%mm2\n" +" movq 24(%0), %%mm3\n" +" movntq %%mm0, (%1)\n" +" movntq %%mm1, 8(%1)\n" +" movntq %%mm2, 16(%1)\n" +" movntq %%mm3, 24(%1)\n" +" movq 32(%0), %%mm0\n" +" movq 40(%0), %%mm1\n" +" movq 48(%0), %%mm2\n" +" movq 56(%0), %%mm3\n" +" movntq %%mm0, 32(%1)\n" +" movntq %%mm1, 40(%1)\n" +" movntq %%mm2, 48(%1)\n" +" movntq %%mm3, 56(%1)\n" + ".section .fixup, \"ax\"\n" + "3: movw $0x05EB, 1b\n" /* jmp on 5 bytes */ + " jmp 2b\n" + ".previous\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b, 3b\n" + ".previous" + : : "r" (from), "r" (to) : "memory");
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On 16 Aug 2005 15:15:35 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote: > However it disables preemption, which especially for bigger > copies will probably make the low latency people unhappy. In the copy loop, +#ifdef CONFIG_PREEMPT + if ( (i%64)==0 ) { + MMX_RESTORE; + MMX_SAVE; + }; +#endif It costs several hundred clocks (wow) every 4KB copy. It kills throughput but it makes the low latency people smile. So I make two APIs. __copy_user_zeroing_nocache() __copy_user_zeroing_inatomic_nocache() The former is a low latency version and the other is a throughput version. What do you think? Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Chuck, On 8/18/05, Chuck Ebbert <[EMAIL PROTECTED]> wrote: > On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote: > > > 3) page faults/exceptions/... > > 3-1 TS flag is set by the CPU (Am I right?) > > TS will _not_ be set if a trap/fault or interrupt occurs. The only > way that could happen automatically would be to use a separate hardware > task with its own TSS to handle those. OK. > And since the kernel does not have any state information of its own > (no task_struct) any attempt to save the kernel-mode FPU state would > overwrite the current user-mode state anyway. > > Interrupt and fault handlers will not use FP instructions anyway. > The only thing you have to worry about is getting scheduled away > while your code is running, and I guess that's why you have to worry > about page faults. And as Arjan pointed out, if you are doing > __copy_from_user_inatomic you cannot sleep (==switch to another task.) > > So I would try the code from include/asm-i386/xor.h, modify it to > save as many registers as you plan to use and see what happens. It will > do all the right things. See the xor_sse_2() for how to save and restore > properly -- you will need to put your xmm_save area on the stack. My hack is the following. I just change from using kernel_fpu_begin() and kernel_fpu_end() to using a stack. My test does not find any regressions. --- usercopy.c.orig 2005-08-05 16:04:37.0 +0900 +++ usercopy.c 2005-08-18 16:53:37.0 +0900 @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -511,6 +512,144 @@ : "memory");\ } while (0) +#define MMX_SAVE do { \ +preempt_disable(); \ +__asm__ __volatile__ ( \ +"movl %%cr0,%0 ;\n\t" \ +"clts ;\n\t" \ +"movq %%mm0,(%1) ;\n\t" \ +"movq %%mm1,8(%1) ;\n\t" \ +"movq %%mm2,16(%1) ;\n\t" \ +"movq %%mm3,24(%1) ;\n\t" \ +: "=&r" (cr0) \ +: "r" (mmx_save)\ +: "memory");\ +} while(0) + +#define MMX_RESTORE do { \ +__asm__ __volatile__ ( \ +"sfence ;\n\t" \ +"movq (%1),%%mm0 ;\n\t" \ +"movq 8(%1),%%mm1 ;\n\t" \ +"movq 16(%1),%%mm2 ;\n\t" \ +"movq 24(%1),%%mm3 ;\n\t" \ +"movl %0,%%cr0;\n\t" \ +: \ +: "r" (cr0), "r" (mmx_save) \ +: "memory");\ +preempt_enable(); \ +} while(0) + +#define ALIGN8 __attribute__((aligned(8))) + +/* Non Temporal Hint version of mmx_memcpy */ +/* It is cache aware */ +/* [EMAIL PROTECTED] */ +static unsigned long +__copy_user_zeroing_nocache(void *to, const void *from, size_t len) +{ +/* Note! gcc doesn't seem to align stack variables properly, so we + * need to make use of unaligned loads and stores. + */ + void *p; + int i; +char mmx_save[8*4] ALIGN8; +int cr0; + + if (unlikely(in_interrupt())){ + __copy_user_zeroing(to, from, len); + return len; + } + + p = to; + i = len >> 6; /* len/64 */ + + /*kernel_fpu_begin();*/ + MMX_SAVE; + + __asm__ __volatile__ ( + "1: prefetchnta (%0)\n" /* This set is 28 bytes */ + " prefetchnta 64(%0)\n" + " prefetchnta 128(%0)\n" + " prefetchnta 192(%0)\n" + " prefetchnta 256(%0)\n" + "2: \n" + ".section .fixup, \"ax\"\n" + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */ + " jmp 2b\n" + ".previous\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b, 3b\n" + ".previous" + : : "r" (from) ); + + for(; i>5; i--) + { + __asm__ __volatile__ ( + "1: prefetchnta 320(%0)\n" +"2: movq (%0), %%mm0\n" +" movq 8(%0), %%mm1\n" +" movq 16(%0), %%mm2\n" +" movq 24(%0), %%mm3\n" +" movntq %%mm0, (%1)\n" +" movntq %%mm1, 8(%1)\n" +" movntq %%mm2, 16(%1)\n" +" movntq %%mm3, 24(%1)\n" +" movq 32(%0), %%mm0\n" +" movq 40(%0), %%mm1\n" +" movq 48(%0), %%mm2\n" +" movq 56(%0), %%mm3\n" +" movntq %%mm0, 32(%1)\n
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Wed, 17 Aug 2005 23:30:13 +0900 Akira Tsukamoto <[EMAIL PROTECTED]> mentioned: > > I'm trying to understand this mechanism but I don't > > understand very well. > > My explanation was a bit ambiguous, see the code below. > Where the fp register saved? It saves fp register *inside* task_struct, More clarification, to make fp_save generic, after exception, such as pagefault, copy function might get nested, during page allocation. First it has user space fp content, but nested copy needs to save kernel space fp content which came from the first copy function. So saving into task_struct is bit problem. XMM_SAVE/XMM_RESTORE uses stack for it. Surrounding copy loop with XMM_SAVE/XMM_RESTORE should work. Some might claim that, saving/restore every time might a big overhead,,, but i think it is better than having a lot of cache miss hit. Isn't there some way to avoid long preemption disabling? -- Akira Tsukamoto <[EMAIL PROTECTED], [EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Wed, 17 Aug 2005 at 13:50:22 +0900 (JST), Hiro Yoshioka wrote: > 3) page faults/exceptions/... > 3-1 TS flag is set by the CPU (Am I right?) TS will _not_ be set if a trap/fault or interrupt occurs. The only way that could happen automatically would be to use a separate hardware task with its own TSS to handle those. And since the kernel does not have any state information of its own (no task_struct) any attempt to save the kernel-mode FPU state would overwrite the current user-mode state anyway. Interrupt and fault handlers will not use FP instructions anyway. The only thing you have to worry about is getting scheduled away while your code is running, and I guess that's why you have to worry about page faults. And as Arjan pointed out, if you are doing __copy_from_user_inatomic you cannot sleep (==switch to another task.) So I would try the code from include/asm-i386/xor.h, modify it to save as many registers as you plan to use and see what happens. It will do all the right things. See the xor_sse_2() for how to save and restore properly -- you will need to put your xmm_save area on the stack. __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
I am resubmitting this because it seems to be lost when I posted the before yesterday. Arjan van de Ven mentioned: > The only comment/question I have is about the use of prefetchnta; that > might have cache-evicting properties as well (eg evict the cache of the > original of the copy, eg the userspace memory). Is that really the right > approach? > In addition, my measurements show that removing the prefetch from the > main copy loop is a gain because the modern cpus have an autoprefetcher > already in the hardware. My computer with Athlon K7 was faster with manually prefetching, but I did not know it is already becoming obsolete. It was pretty while ago, but I also made a similar copy_user function; http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy-47.patch I add comments on each item in the copy function. It was basically inspired from Takahashi's intel faster copy function. I also have some explanation about the speedup for pipelined cpu. http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/copy_for_highlypipelined_cpu.txt It was originally discussed in this thread, http://marc.theaimsgroup.com/?l=linux-kernel&m=103742983924070&w=2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Wed, 17 Aug 2005 14:10:34 +0900 Hiro Yoshioka <[EMAIL PROTECTED]> mentioned: > On 8/17/05, Akira Tsukamoto <[EMAIL PROTECTED]> wrote: > > Anyway, going back to copy_user topic, > > big remaining issues are > > 1)store/restore floating point register (80/64bytes) twice every time by > > surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty > > I don't know. If nobody uses MMX/XMM, then there is no need > to save and restore. I think you are misunderstanding between 1)lazy fpu save handling for user space task 2)kernel_fpu_begin()/kernel_fpu_end() inside the kernel > > 2)after pagefault not always come back to copy function and corrupts fp > > register > > I'm trying to understand this mechanism but I don't > understand very well. My explanation was a bit ambiguous, see the code below. Where the fp register saved? It saves fp register *inside* task_struct, static inline void kernel_fpu_begin(void) + if (tsk->flags & PF_USEDFPU) { + asm volatile("rex64 ; fxsave %0 ; fnclex" + : "=m" (tsk->thread.i387.fxsave)); static inline void save_init_fpu( struct task_struct *tsk ) + if ( cpu_has_fxsr ) { + asm volatile( "fxrstor %0" + : : "m" (tsk->thread.i387.fxsave) ); What happens, during your copy function, if memory is not allocated and generates pagefualt and goto reclaim memories and go into task switch and change to other task. -- Akira Tsukamoto - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Akira, Thanks for your suggestions. On 8/17/05, Akira Tsukamoto <[EMAIL PROTECTED]> wrote: > Anyway, going back to copy_user topic, > big remaining issues are > 1)store/restore floating point register (80/64bytes) twice every time by > surrounding with kernel_fpu_begin()/kernel_fpu_end() is big penalty I don't know. If nobody uses MMX/XMM, then there is no need to save and restore. > 2)after pagefault not always come back to copy function and corrupts fp > register I'm trying to understand this mechanism but I don't understand very well. > 3)disabling long preemption > Please correct me if I am wrong. > > I tried to implement fpsave inside pagefault handler once and here is my junk; > http://www.suna-asobi.com/~akira-t/linux/k7-copy-user/K7-copy_47_with_fpusave_not_finished.patch > never had a time to finish it. Hiro, does it help you? Thanks. I'm reading your patch but could not understand very well. I'll ask you. Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Hiro Yoshioka <[EMAIL PROTECTED]> Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Date: Wed, 17 Aug 2005 08:21:53 +0900 (JST) Message-ID: <[EMAIL PROTECTED]> > Chuck, > > From: Chuck Ebbert <[EMAIL PROTECTED]> > > On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote: > > > oh, really? Does the linux kernel take care of > > > SSE save/restore on a task switch? > > > > Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h > > Thanks for your suggestion. But it seems to me it won't help > when we have a page fault or other exeptions. Hi, Let me understand what the kernel does save/resfore FPU/MMX/XMM registers. Please let me know if I'm wrong. 1) kernel_fpu_begin() preempt_disable() if TS_USEDFPU then __save_init_fpu() ... save to tsk->thread.i387.f*save clear TS_USEDFPU flag of tsk->thread_info->status else clts() --- clear TS flag of CR0 2) copy MMX/XMM registers are used. 3) page faults/exceptions/... 3-1 TS flag is set by the CPU (Am I right?) if nobody uses MMX/XMM 3-2 it's fine. we don't need save/restore else 3-3 MMX/XMM is used When TS flag is set, the CPU monitors the instruction stream of X87 FPU/MMX/SSE/SSE2 instructions. When the CPU detects one of these instruction, it raises a device-not-available exception (#NM) prior to executing the instruction. (IA32 Software Developer's Manual, Vol. 3, 12.5.1) math_state_restore() is the device-not-available exception clts() if (!tsk_used_math(tsk)) init_fpu(tsk); restore_fpu(tsk); set TS_USEDFPU; 4) kernel_fpu_end() stts(); set TS flag of CR0 preempt_enable(); It seems to me that the kernel automatically save/restore FPU/MMX/XMM registers. What's wrong with it? Do I misunderstand it? Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Chuck, From: Chuck Ebbert <[EMAIL PROTECTED]> > On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote: > > oh, really? Does the linux kernel take care of > > SSE save/restore on a task switch? > > Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h Thanks for your suggestion. But it seems to me it won't help when we have a page fault or other exeptions. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Tue, 16 Aug 2005 at 19:16:17 +0900 (JST), Hiro Yoshioka wrote: > oh, really? Does the linux kernel take care of > SSE save/restore on a task switch? Check out XMMS_SAVE and XMMS_RESTORE in include/asm-i386/xor.h __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Arjan van de Ven <[EMAIL PROTECTED]> writes: > > not on kernel entry afaik. > However just save the register on the stack and put it back at the > end... You need to do more than that, like disabling lazy FPU mode. That is what kernel_fpu_begin/end takes care of. However it disables preemption, which especially for bigger copies will probably make the low latency people unhappy. Without disabling preemption there is no way to use SSE right now. Note that there is also an integer NT store in SSE1, however at least in Athlon K7 it is microcoded and very slow. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, > > > > My code does nothing do it. > > > > > > > > I need a volunteer to implement it. > > > > > > it's actually not too hard; all you need is to use SSE and not MMX; and > > > then just store sse register you're overwriting on the stack or so... > > > > oh, really? Does the linux kernel take care of > > SSE save/restore on a task switch? > > not on kernel entry afaik. > However just save the register on the stack and put it back at the > end... I think this have to be done in the pagefault handlers. Thanks, Hirokazu Takahashi. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Tue, 2005-08-16 at 19:16 +0900, Hiro Yoshioka wrote: > From: Arjan van de Ven <[EMAIL PROTECTED]> > > > My code does nothing do it. > > > > > > I need a volunteer to implement it. > > > > it's actually not too hard; all you need is to use SSE and not MMX; and > > then just store sse register you're overwriting on the stack or so... > > oh, really? Does the linux kernel take care of > SSE save/restore on a task switch? not on kernel entry afaik. However just save the register on the stack and put it back at the end... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi > > > My code does nothing do it. > > > > > > I need a volunteer to implement it. > > > > it's actually not too hard; all you need is to use SSE and not MMX; and > > then just store sse register you're overwriting on the stack or so... > > oh, really? Does the linux kernel take care of > SSE save/restore on a task switch? noop! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Arjan van de Ven <[EMAIL PROTECTED]> > > My code does nothing do it. > > > > I need a volunteer to implement it. > > it's actually not too hard; all you need is to use SSE and not MMX; and > then just store sse register you're overwriting on the stack or so... oh, really? Does the linux kernel take care of SSE save/restore on a task switch? Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Tue, 2005-08-16 at 12:30 +0900, Hiro Yoshioka wrote: > The following example shows the L3 cache miss is reduced from 37410 to 107. most impressive; it seems the approach to do this selectively is paying off very well! The only comment/question I have is about the use of prefetchnta; that might have cache-evicting properties as well (eg evict the cache of the original of the copy, eg the userspace memory). Is that really the right approach? In addition, my measurements show that removing the prefetch from the main copy loop is a gain because the modern cpus have an autoprefetcher already in the hardware. "1: prefetchnta (%0)\n" /* This set is 28 bytes */ + " prefetchnta 64(%0)\n" + " prefetchnta 128(%0)\n" + " prefetchnta 192(%0)\n" + " prefetchnta 256(%0)\n" + "2: \n" + ".section .fixup, \"ax\"\n" + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */ + " jmp 2b\n" + ".previous\n" oh and prefetch(nta) is a non-faulting instruction so no need for the fixup handling... But overall this is starting to look really interesting! Greetings, Arjan van de Ven - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Tue, 2005-08-16 at 13:17 +0900, Hirokazu Takahashi wrote: > Hi, > > BTW, what are you going to do with the page-faults which may happen > during __copy_user_zeroing_nocache()? The current process may be blocked > in the handler for a while and get FPU registers polluted. > kernel_fpu_begin() won't help the case. This is another issue, though. __copy_from_user_inatomic .. that implies it won't sleep actually ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Tue, 2005-08-16 at 13:54 +0900, Hiro Yoshioka wrote: > Takahashi san, > > I appreciate your comments. > > > Hi, > > > > BTW, what are you going to do with the page-faults which may happen > > during __copy_user_zeroing_nocache()? The current process may be blocked > > in the handler for a while and get FPU registers polluted. > > kernel_fpu_begin() won't help the case. This is another issue, though. > > My code does nothing do it. > > I need a volunteer to implement it. it's actually not too hard; all you need is to use SSE and not MMX; and then just store sse register you're overwriting on the stack or so... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Takahashi san, I appreciate your comments. > Hi, > > BTW, what are you going to do with the page-faults which may happen > during __copy_user_zeroing_nocache()? The current process may be blocked > in the handler for a while and get FPU registers polluted. > kernel_fpu_begin() won't help the case. This is another issue, though. My code does nothing do it. I need a volunteer to implement it. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, BTW, what are you going to do with the page-faults which may happen during __copy_user_zeroing_nocache()? The current process may be blocked in the handler for a while and get FPU registers polluted. kernel_fpu_begin() won't help the case. This is another issue, though. > > Thanks. > > > > filemap_copy_from_user() calls __copy_from_user_inatomic() calls > > __copy_from_user_ll(). > > > > I'll look at the code. > > The following is a quick hack of cache aware implementation > of __copy_from_user_ll() and __copy_from_user_inatomic() > > __copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache() > > filemap_copy_from_user() calles __copy_from_user_inatomic_nocache() > instead of __copy_from_user_inatomic() and reduced cashe miss. > > The first column is the cache reference (memory access) and the > third column is the 3rd level cache miss. > > The following example shows the L3 cache miss is reduced from 37410 to 107. > > 2.6.12.4 nocache version > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x3f (multiple flags) count 3000 > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x200 (read 3rd level cache miss) count 3000 > samples %samples % app name symbol name > 1204426.4106 1070.5620 vmlinux__copy_user_zeroing_nocache > 80049 4.2606 5783.0357 vmlinuxjournal_add_journal_head > 69194 3.6829 1540.8088 vmlinuxjournal_dirty_metadata > 67059 3.5692 78 0.4097 vmlinux__find_get_block > 64145 3.4141 32 0.1681 vmlinuxjournal_put_journal_head > pattern9-0-cpu4-0-08161154/summary.out > > The 2.6.12.4 original version is > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x3f (multiple flags) count 3000 > Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) > with a unit mask of 0x200 (read 3rd level cache miss) count 3000 > samples %samples % app name symbol name > 1206467.4680 37410 62.3355 vmlinux__copy_from_user_ll > 79508 4.9215 9031.5046 vmlinux_spin_lock > 65526 4.0561 8731.4547 vmlinuxjournal_add_journal_head > 59296 3.6704 1290.2149 vmlinux__find_get_block > 58647 3.6302 2150.3582 vmlinuxjournal_dirty_metadata > > What do you think? > > Hiro > > diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile > --- linux-2.6.12.4.orig/Makefile 2005-08-12 14:37:59.0 +0900 > +++ linux-2.6.12.4.nocache/Makefile 2005-08-16 10:22:31.0 +0900 > @@ -1,7 +1,7 @@ > VERSION = 2 > PATCHLEVEL = 6 > SUBLEVEL = 12 > -EXTRAVERSION = .4.orig > +EXTRAVERSION = .4.nocache > NAME=Woozy Numbat > > # *DOCUMENTATION* > diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c > linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c > --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 2005-08-05 > 16:04:37.0 +0900 > +++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c 2005-08-16 > 10:49:59.0 +0900 > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -511,6 +512,110 @@ > : "memory");\ > } while (0) > > +/* Non Temporal Hint version of mmx_memcpy */ > +/* It is cache aware */ > +/* [EMAIL PROTECTED] */ > +static unsigned long > +__copy_user_zeroing_nocache(void *to, const void *from, size_t len) > +{ > +/* Note! gcc doesn't seem to align stack variables properly, so we > + * need to make use of unaligned loads and stores. > + */ > + void *p; > + int i; > + > + if (unlikely(in_interrupt())){ > + __copy_user_zeroing(to, from, len); > + return len; > + } > + > + p = to; > + i = len >> 6; /* len/64 */ > + > +kernel_fpu_begin(); > + > + __asm__ __volatile__ ( > + "1: prefetchnta (%0)\n" /* This set is 28 bytes */ > + " prefetchnta 64(%0)\n" > + " prefetchnta 128(%0)\n" > + " prefetchnta 192(%0)\n" > + " prefetchnta 256(%0)\n" > + "2: \n" > + ".section .fixup, \"ax\"\n" > + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */ > + " jmp 2b\n" > + ".previous\n" > + ".section __ex_table,\"a\"\n" > + " .align 4\n" > + " .long 1b, 3b\n" > + ".previous" > + : : "r" (from) ); > + > + for(; i>5; i--) > + { > + __asm__ __volatile__ ( > + "1: prefetchnta 320(%0)\n" > + "2: movq (%0), %%mm0\n" > + " movq 8(%0), %%mm1\n" > + " movq 16(%0), %%mm2\n" > + " movq 24(%0), %%m
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
From: Hiro Yoshioka <[EMAIL PROTECTED]> Date: Tue, 16 Aug 2005 08:33:59 +0900 > Thanks. > > filemap_copy_from_user() calls __copy_from_user_inatomic() calls > __copy_from_user_ll(). > > I'll look at the code. The following is a quick hack of cache aware implementation of __copy_from_user_ll() and __copy_from_user_inatomic() __copy_from_user_ll_nocache() and __copy_from_user_inatomic_nocache() filemap_copy_from_user() calles __copy_from_user_inatomic_nocache() instead of __copy_from_user_inatomic() and reduced cashe miss. The first column is the cache reference (memory access) and the third column is the 3rd level cache miss. The following example shows the L3 cache miss is reduced from 37410 to 107. 2.6.12.4 nocache version Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %samples % app name symbol name 1204426.4106 1070.5620 vmlinux__copy_user_zeroing_nocache 80049 4.2606 5783.0357 vmlinuxjournal_add_journal_head 69194 3.6829 1540.8088 vmlinuxjournal_dirty_metadata 67059 3.5692 78 0.4097 vmlinux__find_get_block 64145 3.4141 32 0.1681 vmlinuxjournal_put_journal_head pattern9-0-cpu4-0-08161154/summary.out The 2.6.12.4 original version is Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %samples % app name symbol name 1206467.4680 37410 62.3355 vmlinux__copy_from_user_ll 79508 4.9215 9031.5046 vmlinux_spin_lock 65526 4.0561 8731.4547 vmlinuxjournal_add_journal_head 59296 3.6704 1290.2149 vmlinux__find_get_block 58647 3.6302 2150.3582 vmlinuxjournal_dirty_metadata What do you think? Hiro diff -ur linux-2.6.12.4.orig/Makefile linux-2.6.12.4.nocache/Makefile --- linux-2.6.12.4.orig/Makefile2005-08-12 14:37:59.0 +0900 +++ linux-2.6.12.4.nocache/Makefile 2005-08-16 10:22:31.0 +0900 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 12 -EXTRAVERSION = .4.orig +EXTRAVERSION = .4.nocache NAME=Woozy Numbat # *DOCUMENTATION* diff -ur linux-2.6.12.4.orig/arch/i386/lib/usercopy.c linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 16:04:37.0 +0900 +++ linux-2.6.12.4.nocache/arch/i386/lib/usercopy.c 2005-08-16 10:49:59.0 +0900 @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -511,6 +512,110 @@ : "memory");\ } while (0) +/* Non Temporal Hint version of mmx_memcpy */ +/* It is cache aware */ +/* [EMAIL PROTECTED] */ +static unsigned long +__copy_user_zeroing_nocache(void *to, const void *from, size_t len) +{ +/* Note! gcc doesn't seem to align stack variables properly, so we + * need to make use of unaligned loads and stores. + */ + void *p; + int i; + + if (unlikely(in_interrupt())){ + __copy_user_zeroing(to, from, len); + return len; + } + + p = to; + i = len >> 6; /* len/64 */ + +kernel_fpu_begin(); + + __asm__ __volatile__ ( + "1: prefetchnta (%0)\n" /* This set is 28 bytes */ + " prefetchnta 64(%0)\n" + " prefetchnta 128(%0)\n" + " prefetchnta 192(%0)\n" + " prefetchnta 256(%0)\n" + "2: \n" + ".section .fixup, \"ax\"\n" + "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */ + " jmp 2b\n" + ".previous\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b, 3b\n" + ".previous" + : : "r" (from) ); + + for(; i>5; i--) + { + __asm__ __volatile__ ( + "1: prefetchnta 320(%0)\n" + "2: movq (%0), %%mm0\n" + " movq 8(%0), %%mm1\n" + " movq 16(%0), %%mm2\n" + " movq 24(%0), %%mm3\n" + " movntq %%mm0, (%1)\n" + " movntq %%mm1, 8(%1)\n" + " movntq %%mm2, 16(%1)\n" + " movntq %%mm3, 24(%1)\n" + " movq 32(%0), %%mm0\n" + " movq 40(%0), %%mm1\n" + " movq 48(%0), %%mm2\n" + " movq 56(%0), %%mm3\n" + " movntq %%mm0, 32
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On 8/15/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > copy_from_user_nocache() is fine. > > > > But I don't know where I can use it. (I'm not so > > familiar with the linux kernel file system yet.) > > I suspect the few cases where it will make the most difference will be > in the VFS for the write() system call, and the AIO variants thereof. > > generic_file_buffered_write() will be a good candidate to try first... Thanks. filemap_copy_from_user() calls __copy_from_user_inatomic() calls __copy_from_user_ll(). I'll look at the code. Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Mon, Aug 15, 2005 at 05:09:12PM +0200, Arjan van de Ven wrote: > On Mon, 2005-08-15 at 17:02 +0200, Andi Kleen wrote: > > Arjan van de Ven <[EMAIL PROTECTED]> writes: > > > > > On Mon, 2005-08-15 at 08:15 -0400, [EMAIL PROTECTED] wrote: > > > > Actually, is there any place *other* than write() to the page cache that > > > > warrants a non-temporal store? Network sockets with scatter/gather and > > > > hardware checksum, maybe? > > > > > > afaik those use zero copy already, eg straight pagecache copy. > > > > Only if you use sendfile(). And the normal write path uses csum_copy_* > > but do those use s/g ? sendfile yes. sendmsg also when the MTU of the device is larger than a page. > and hw csum? sendmsg normally not. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Mon, 2005-08-15 at 17:02 +0200, Andi Kleen wrote: > Arjan van de Ven <[EMAIL PROTECTED]> writes: > > > On Mon, 2005-08-15 at 08:15 -0400, [EMAIL PROTECTED] wrote: > > > Actually, is there any place *other* than write() to the page cache that > > > warrants a non-temporal store? Network sockets with scatter/gather and > > > hardware checksum, maybe? > > > > afaik those use zero copy already, eg straight pagecache copy. > > Only if you use sendfile(). And the normal write path uses csum_copy_* but do those use s/g ? and hw csum? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Arjan van de Ven <[EMAIL PROTECTED]> writes: > On Mon, 2005-08-15 at 08:15 -0400, [EMAIL PROTECTED] wrote: > > Actually, is there any place *other* than write() to the page cache that > > warrants a non-temporal store? Network sockets with scatter/gather and > > hardware checksum, maybe? > > afaik those use zero copy already, eg straight pagecache copy. Only if you use sendfile(). And the normal write path uses csum_copy_* -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Mon, 2005-08-15 at 09:21 +0200, Arjan van de Ven wrote: > On Sun, 2005-08-14 at 23:24 +0200, Ian Kumlien wrote: > > Hi, all > > > > I might be missunderstanding things but... > > > > First of all, machines with long pipelines will suffer from cache misses > > (p4 in this case). > > > > Depending on the size copied, (i don't know how large they are so..) > > can't one run out of cachelines and/or evict more useful cache data? > > CPU caches are really big nowadays Yes but (is copy to/from user size limited?) whats the cahes size compared to the copy operation preformed compared to lost useful cachelines =) > > Ie, if it's cached from begining to end, we generally only need 'some > > of' the begining, the cpu's prefetch should manage the rest. > > cpu prefetch isn't going to be fast enough. It helps some, but in the > end the cpu prefetch also has to wait for the ram, it doesn't make the > ram faster or free, it just takes a jumpstart on getting to it. Yeah i know, but i was thinking more of a compromize, then it might be better... > > I might, as i said, not know all about things like this and i also > > suffer from a fever but i still find Hiro's data interesting. > > It is. It's good proof that you can make a big gain already by > converting a few key places to his excellent code. And neither me nor > Christoph are suggesting to ditch his effort! Instead we suggest that > what he is doing is useful for some cases and harmful for others, and > that it is quite easy to identify those cases and separate them from > eachother, and that thus as a result it is more optimal to have 2 apis, > one for each of the cases. Thats good to know, since i have wondered for a while why block io seems so oddly slow... I just thought that there might be some good compromize between the two that would make it automatic. Oh well, guess i'm back to coughing and waiting for patches to be implemented =) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Mon, 2005-08-15 at 08:15 -0400, [EMAIL PROTECTED] wrote: > Actually, is there any place *other* than write() to the page cache that > warrants a non-temporal store? Network sockets with scatter/gather and > hardware checksum, maybe? afaik those use zero copy already, eg straight pagecache copy. Eg that's the only case where s/g is used right now, and that case doesn't copy already. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Actually, is there any place *other* than write() to the page cache that warrants a non-temporal store? Network sockets with scatter/gather and hardware checksum, maybe? This is pretty much synonomous with what is allowed to go into high memory, no? While we're on the subject, for the copy_from_user source, prefetchnta is probably indicated. If user space hasn't caused it to be cached already (admittedly, the common case), we *know* the kernel isn't going to look at that data again. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Mon, 2005-08-15 at 17:44 +0900, Hiro Yoshioka wrote: > Hi, > > I appreciate your suggestion. > > On 8/15/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > > > > Anyway we could not find the cache aware version of __copy_from_user_ll > > > has a big regression yet. > > > > > > that is because you spread the cache misses out from one place to all > > over the place, so that no one single point sticks out anymore. > > > > Do you agree that your copy is less optimal for the case where the > > kernel will (almost) immediately use the data? > > Yes, I do. > > My server has 8KB of L1 cache. (512KB of L2/2MB of L3) > > If you move more than 4KB of data using by __copy_from_user_ll(), the > data will be spilled over L1 cache but in L2 (or L3) L2 access time isn't too bad. your code evicts the data even from L2 and L3 though (even if it was in there before).. > When you move huge data (> 1MB), even L3 cache will not help you. > (This is known as a cache pollution.) yes. > copy_from_user_nocache() is fine. > > But I don't know where I can use it. (I'm not so > familiar with the linux kernel file system yet.) I suspect the few cases where it will make the most difference will be in the VFS for the write() system call, and the AIO variants thereof. generic_file_buffered_write() will be a good candidate to try first... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, I appreciate your suggestion. On 8/15/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > > Anyway we could not find the cache aware version of __copy_from_user_ll > > has a big regression yet. > > > that is because you spread the cache misses out from one place to all > over the place, so that no one single point sticks out anymore. > > Do you agree that your copy is less optimal for the case where the > kernel will (almost) immediately use the data? Yes, I do. My server has 8KB of L1 cache. (512KB of L2/2MB of L3) If you move more than 4KB of data using by __copy_from_user_ll(), the data will be spilled over L1 cache but in L2 (or L3) When you move huge data (> 1MB), even L3 cache will not help you. (This is known as a cache pollution.) > I agree that your copy is really nice for places where the kernel will > NOT use the data in the cpu, say for big write() system calls. > > My suggestion is to realize there are basically 2 different use cases, > and that in the code the first one is very common, while in your > profiles the second one is very common. Based on that I suggest to make > a special copy_from_user_nocache() API for the cases where the kernel > will not use the data (and ignore software raid5 here) and use your > excellent version for that API, while leaving the code for the cases > where the kernel WILL use the data alone. Code wise the "will use" case > is the vast majority, so only changing the few places that know they > don't use the data will be very efficient, and will give immediate big > improvement in your profile data, since those few places tend to get > used a lot in the cases you benchmark. copy_from_user_nocache() is fine. But I don't know where I can use it. (I'm not so familiar with the linux kernel file system yet.) Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Sun, 2005-08-14 at 23:24 +0200, Ian Kumlien wrote: > Hi, all > > I might be missunderstanding things but... > > First of all, machines with long pipelines will suffer from cache misses > (p4 in this case). > > Depending on the size copied, (i don't know how large they are so..) > can't one run out of cachelines and/or evict more useful cache data? CPU caches are really big nowadays > > Ie, if it's cached from begining to end, we generally only need 'some > of' the begining, the cpu's prefetch should manage the rest. cpu prefetch isn't going to be fast enough. It helps some, but in the end the cpu prefetch also has to wait for the ram, it doesn't make the ram faster or free, it just takes a jumpstart on getting to it. > I might, as i said, not know all about things like this and i also > suffer from a fever but i still find Hiro's data interesting. It is. It's good proof that you can make a big gain already by converting a few key places to his excellent code. And neither me nor Christoph are suggesting to ditch his effort! Instead we suggest that what he is doing is useful for some cases and harmful for others, and that it is quite easy to identify those cases and separate them from eachother, and that thus as a result it is more optimal to have 2 apis, one for each of the cases. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
> Anyway we could not find the cache aware version of __copy_from_user_ll > has a big regression yet. that is because you spread the cache misses out from one place to all over the place, so that no one single point sticks out anymore. Do you agree that your copy is less optimal for the case where the kernel will (almost) immediately use the data? I agree that your copy is really nice for places where the kernel will NOT use the data in the cpu, say for big write() system calls. My suggestion is to realize there are basically 2 different use cases, and that in the code the first one is very common, while in your profiles the second one is very common. Based on that I suggest to make a special copy_from_user_nocache() API for the cases where the kernel will not use the data (and ignore software raid5 here) and use your excellent version for that API, while leaving the code for the cases where the kernel WILL use the data alone. Code wise the "will use" case is the vast majority, so only changing the few places that know they don't use the data will be very efficient, and will give immediate big improvement in your profile data, since those few places tend to get used a lot in the cases you benchmark. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, From: Arjan van de Ven <[EMAIL PROTECTED]> Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll() Date: Sun, 14 Aug 2005 12:35:43 +0200 Message-ID: <[EMAIL PROTECTED]> > On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote: > > Thanks for your comments. > > > > On 8/14/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote: > > > > Hi, > > > > > > > > The following is a patch to reduce a cache pollution > > > > of __copy_from_user_ll(). > > > > > > > > When I run simple iozone benchmark to find a performance bottleneck of > > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle > > > > most and it did many cache misses. > > > > > > > > > however... you copy something from userspace... aren't you going to USE > > > it? The non-termoral versions actually throw the data out of the > > > cache... so while this part might be nice, you pay BIG elsewhere > > > > The oprofile data does not give an evidence that we pay BIG elsewhere. > > > the problem is that the pay elsewhere is far more spread out, but not > less. At least generally > > I can see the point of a copy_from_user_nocache() or something, for > those cases where we *know* we are not going to use the copied data in > the cpu (but say, only do DMA). > But that should be explicit, not implicit, since the general case will > be that the kernel WILL use the data. And if that's the case your change > is a loss (just harder to see because the cost is spread out) I understand the iozone is not good benchmark nor reprsents any useful application so I did a kernel build as a simple benchmark. What I did is cd /test/f1 tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2 cd linux-2.6.12.4 cp -p ${baseDir}/src/config .config make oldconfig time make -j $CPUS The following is Top 5 of CPU cycle Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 7347544 72.8296 cc1 (no symbols) 5323075.2763 libbz2.so.1.0.2 (no symbols) 2418532.3973 vmlinux buffered_rmqueue 1285521.2742 libc-2.3.4.so_int_malloc 1077841.0684 vmlinux page_fault ... 10749 0.1065 vmlinux __copy_from_user_ll pattern12-0-cpu4-0-08150920/summary.out Since __copy_from_user_ll is not hot spot, so we didn't see any big performance difference. (the number is time (sec) of 5 runs) original 2.6.12.4 realusersystem No profiling532.27 1797.02 194.9 BSQ 0x200+0x3f 620.15 2094.21 212.38 GLOBAL_POWER_EVENTS:10: 586.01 1984.92 215.97 cache aware 2.6.12.4realusersystem No profiling526.65 1792.22 190.05 BSQ 0x200+0x3f 615.51 2090.74 206.58 GLOBAL_POWER_EVENTS:10: 587.69 1978.66 209.18 Now Top 5 of Memory Access (2.6.12.4) Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %samples %app name symbol name 11439689 82.2135 3390627.9328 cc1 (no symbols) 2771771.9920 347 0.2859 libc-2.3.4.so_int_malloc 2295931.6500 1294610.6653 libbz2.so.1.0.2 (no symbols) 84348 0.6062 116 0.0956 libc-2.3.4.so_int_free 83653 0.6012 438 0.3608 libc-2.3.4.socalloc ... 8527 0.0613 1648 1.3577 vmlinux __copy_from_user_ll Top 5 of Cache miss 33906 27.9328 cc1 (no symbols) 30849 25.4144 vmlinux buffered_rmqueue 12946 10.6653 libbz2.so.1.0.2 (no symbols) 91787.5611 vmlinux __copy_to_user_ll 29342.4171 oprofiled (no symbols) ... 16481.3577 vmlinux __copy_from_user_ll pattern12-0-cpu4-0-08150917 Cache aware 2.6.12.4, Top 5 of Memory Access samples %samples %app name symbol name 11448487 82.8100 3278628.1051 cc1 (no symbols) 2768122.0023 256 0.2195 libc-2.3.4.so_int_malloc 2301771.6649 1237110.6048 libbz2.so.1.0.2 (no symbols) 84485 0.6111 120 0.1029 libc-2.3.4.so_int_free 84043 0.6079 473 0.4055 libc-2.3.4.socalloc ... 18282 0.1322 9060 7.7665
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, all I might be missunderstanding things but... First of all, machines with long pipelines will suffer from cache misses (p4 in this case). Depending on the size copied, (i don't know how large they are so..) can't one run out of cachelines and/or evict more useful cache data? Ie, if it's cached from begining to end, we generally only need 'some of' the begining, the cpu's prefetch should manage the rest. I might, as i said, not know all about things like this and i also suffer from a fever but i still find Hiro's data interesting. Isn't there some way to do the same test for the same time and measure the differences in allround data? to see if we really are punished as bad on accessing the data post copy? (could it be size dependant?) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
> the problem is that the pay elsewhere is far more spread out, but not > less. At least generally > > I can see the point of a copy_from_user_nocache() or something, for > those cases where we *know* we are not going to use the copied data in > the cpu (but say, only do DMA). > But that should be explicit, not implicit, since the general case will > be that the kernel WILL use the data. Most of the callers probably want the normal one, but most of the copied data (buffered filesystem I/O) will want the non cache poluting one. So yes, doing this explicit makes a lot of sense. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote: > Thanks for your comments. > > On 8/14/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote: > > > Hi, > > > > > > The following is a patch to reduce a cache pollution > > > of __copy_from_user_ll(). > > > > > > When I run simple iozone benchmark to find a performance bottleneck of > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle > > > most and it did many cache misses. > > > > > > however... you copy something from userspace... aren't you going to USE > > it? The non-termoral versions actually throw the data out of the > > cache... so while this part might be nice, you pay BIG elsewhere > > The oprofile data does not give an evidence that we pay BIG elsewhere. the problem is that the pay elsewhere is far more spread out, but not less. At least generally I can see the point of a copy_from_user_nocache() or something, for those cases where we *know* we are not going to use the copied data in the cpu (but say, only do DMA). But that should be explicit, not implicit, since the general case will be that the kernel WILL use the data. And if that's the case your change is a loss (just harder to see because the cost is spread out) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Thanks for your comments. On 8/14/05, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote: > > Hi, > > > > The following is a patch to reduce a cache pollution > > of __copy_from_user_ll(). > > > > When I run simple iozone benchmark to find a performance bottleneck of > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle > > most and it did many cache misses. > > > however... you copy something from userspace... aren't you going to USE > it? The non-termoral versions actually throw the data out of the > cache... so while this part might be nice, you pay BIG elsewhere The oprofile data does not give an evidence that we pay BIG elsewhere. For examples, the original 2.6.12.4 Top 5 cache misses are the following, 37017 63.4603 vmlinux__copy_from_user_ll 1049 1.7984 vmlinux_spin_lock_irqsave 9401.6115 vmlinuxblk_rq_map_sg 8961.5361 vmlinuxgeneric_file_buffered_write 8851.5172 vmlinux_spin_lock pattern9-0-cpu4-0-08141702 cache aware version Top 5 cache misses are 899 5.7305 vmlinuxblk_rq_map_sg 569 3.6270 vmlinuxjournal_commit_transaction 531 3.3848 vmlinuxradix_tree_delete 514 3.2764 vmlinuxjournal_add_journal_head 505 3.2190 vmlinuxrelease_pages ... 89 0.5673 vmlinux _mmx_memcpy_nt pattern9-0-cpu4-0-08141625 What do you think? Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote: > Hi, > > The following is a patch to reduce a cache pollution > of __copy_from_user_ll(). > > When I run simple iozone benchmark to find a performance bottleneck of > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle > most and it did many cache misses. however... you copy something from userspace... aren't you going to USE it? The non-termoral versions actually throw the data out of the cache... so while this part might be nice, you pay BIG elsewhere - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Hi, The following is a patch to reduce a cache pollution of __copy_from_user_ll(). When I run simple iozone benchmark to find a performance bottleneck of the linux kernel, I found that __copy_from_user_ll() spent CPU cycle most and it did many cache misses. The following is profiled by oprofile. Top 5 CPU cycle CPU: P4 / Xeon, speed 2200.91 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 281538 15.2083 vmlinux __copy_from_user_ll 81069 4.3792 vmlinux _spin_lock 75523 4.0796 vmlinux journal_add_journal_head 63674 3.4396 vmlinux do_get_write_access 52634 2.8432 vmlinux journal_put_journal_head (pattern9-0-cpu4-0-08141700/summary.out) Top 5 Memory Access and Cache miss CPU: P4 / Xeon, speed 2200.91 MHz (estimated) Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %samples %app name symbol name 1208017.4379 3701763.4603 vmlinux __copy_from_user_ll 84139 5.1806 885 1.5172 vmlinux _spin_lock 66027 4.0654 656 1.1246 vmlinux journal_add_journal_head 60400 3.7189 250 0.4286 vmlinux __find_get_block 60032 3.6963 120 0.2057 vmlinux journal_dirty_metadata __copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only 7.4379% of memory access. In order to reduce the cache miss in the __copy_from_user_ll, I made the following patch and confirmed the reduction of the miss. Top 5 CPU cycle CPU: P4 / Xeon, speed 2200.93 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10 samples %app name symbol name 1207178.3454 vmlinux _mmx_memcpy_nt 65955 4.5596 vmlinux do_get_write_access 56088 3.8775 vmlinux journal_put_journal_head 52550 3.6329 vmlinux journal_dirty_metadata 38886 2.6883 vmlinux journal_add_journal_head pattern9-0-cpu4-0-08141627/summary.out _mmx_memcpy_nt is the new function which is called from __copy_from_user_ll and it spent only 42.88% of the original implementation. (120717/281538==42.88%) Top 5 Memory Access CPU: P4 / Xeon, speed 2200.93 MHz (estimated) Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000 Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000 samples %samples %app name symbol name 90918 6.3079 890.5673 vmlinux _mmx_memcpy_nt 83654 5.8039 177 1.1283 vmlinux journal_dirty_metadata 57836 4.0127 348 2.2183 vmlinux journal_put_journal_head 48236 3.3466 165 1.0518 vmlinux do_get_write_access 44546 3.0906 210.1339 vmlinux __getblk The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is 0.24% of the original implementation. The actual elapse time which five times run were 229.76 (sec) and 222.94 (sec). (229.76/222.94= 3.06% gain) iozone -CMR -i 0 -+n -+u -s 8000MB -t 4 What do you think? --- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c2005-08-05 16:04:37.0 +0900 +++ linux-2.6.12.4/arch/i386/lib/usercopy.c 2005-08-12 13:18:14.106916200 +0900 @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -511,6 +512,108 @@ : "memory");\ } while (0) +/* Non Temporal Hint version of mmx_memcpy */ +/* It is cache aware */ +/* [EMAIL PROTECTED] */ +static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len) +{ +/* Note! gcc doesn't seem to align stack variables properly, so we + * need to make use of unaligned loads and stores. + */ + void *p; + int i; + + if (unlikely(in_interrupt())){ + __copy_user_zeroing(to, from, len); + return len; + } + + p = to; + i = len >> 6; /* len/64 */ + +kernel_fpu_begin(); + + __asm__ __volatile__ ( + "1: prefetchnta (%0)\n" /* This set is 28 bytes */ + " prefetchnta 64(%0)\n" + " prefetchnta 128(%0)\n" +