[dpdk-dev] some questions about rte_memcpy
On 2015/1/22 23:21, Bruce Richardson wrote: > This (size_c) is a run-time constant, not a compile-time constant. To trigger > the > memcpy optimizations inside the compiler, the size value must be constant at > compile time. Hi, Bruce You are right. When use compile-time constant memcpy is faster.Thank you for all. Here is my test result: rte_memcpy(constant) size:8 time:876 rte_memcpy(variable) size:8 time:2824 rte_memcpy(constant) size:16 time:868 rte_memcpy(variable) size:16 time:4436 rte_memcpy(constant) size:32 time:856 rte_memcpy(variable) size:32 time:3264 rte_memcpy(constant) size:48 time:872 rte_memcpy(variable) size:48 time:3972 rte_memcpy(constant) size:64 time:856 rte_memcpy(variable) size:64 time:3644 rte_memcpy(constant) size:128 time:868 rte_memcpy(variable) size:128 time:4720 rte_memcpy(constant) size:256 time:868 rte_memcpy(variable) size:256 time:9624 Here is my test program(Who know how to use a loop to test 'constant memcpy'?): #include #include #include int main(int narg, char** args) { int i,t; char buf[256]; int tests[7] = {8,16,32,48,64,128,256}; char buf8[8],buf16[16],buf32[32],buf48[48],buf64[64],buf128[128],buf256[256]; uint64_t start, end; int times = 999; uint64_t result_c[7]; if (narg < 2) { printf("usage:./rte_memcpy_test times\n"); return -1; } times = atoi(args[1]); start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf8, buf8, sizeof buf8); } end = rte_rdtsc(); result_c[0] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf16, buf16, sizeof buf16); } end = rte_rdtsc(); result_c[1] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf32, buf32, sizeof buf32); } end = rte_rdtsc(); result_c[2] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf48, buf48, sizeof buf48); } end = rte_rdtsc(); result_c[3] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf64, buf64, sizeof buf64); } end = rte_rdtsc(); result_c[4] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf128, buf128, sizeof buf128); } end = rte_rdtsc(); result_c[5] = end - start; start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf256, buf256, sizeof buf256); } end = rte_rdtsc(); result_c[6] = end - start; for (i = 0; i < (sizeof tests / sizeof tests[0]); i++) { start = rte_rdtsc(); for(t = 0; t < times; t++) { rte_memcpy(buf, buf, tests[i]); } end = rte_rdtsc(); printf("rte_memcpy(constant) size:%d time:%llu\n", tests[i], result_c[i]); printf("rte_memcpy(variable) size:%d time:%llu\n", tests[i], end - start); } return 0; } -- Regards, Haifeng
[dpdk-dev] some questions about rte_memcpy
On 2015/1/22 19:34, Bruce Richardson wrote: > On Thu, Jan 22, 2015 at 07:23:49PM +0900, Tetsuya Mukawa wrote: >> On 2015/01/22 16:35, Matthew Hall wrote: >>> On Thu, Jan 22, 2015 at 01:32:04PM +0800, Linhaifeng wrote: Do you mean if call rte_memcpy before rte_eal_init() would crash?why? >>> No guarantee. But a theory. It might use some things from the EAL init to >>> figure out which version of the accelerated algorithm to use. >> >> This selection is done at compile-time. >> And if the size is constant, I guess DPDK assumes memcpy is replaced by >> inline __builtin_memcpy. >> I haven't checked the performance of builtin memcpy, but probably much >> faster. >> > > Yes, that assumption is correct. A couple of years ago we discovered that for > constant size values, the compiler would generate much faster code for us > using a regular memcpy than rte_memcpy, hence the macro. > > /Bruce > >> Tetsuya >> >>> Matthew. >> >> > > Hi,Bruce I test it,most results like you said use constant may be faster,but sometimes not. linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 rte_memcpy(constant) used:279893712 @@ not faster rte_memcpy(variable) used:277818600 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 rte_memcpy(constant) used:279264328 @@ not faster rte_memcpy(variable) used:277667116 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 rte_memcpy(constant) used:279491832 @@ not faster rte_memcpy(variable) used:277622772 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 rte_memcpy(constant) used:279402156 @@ not faster rte_memcpy(variable) used:277738464 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 rte_memcpy(constant) used:279305172 @@ not faster rte_memcpy(variable) used:277483004 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 rte_memcpy(constant) used:279784124 @@ not faster rte_memcpy(variable) used:277605332 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 rte_memcpy(constant) used:322817260 rte_memcpy(variable) used:350333864 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 rte_memcpy(constant) used:322840748 rte_memcpy(variable) used:350297868 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 rte_memcpy(constant) used:322488240 rte_memcpy(variable) used:350348652 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 rte_memcpy(constant) used:322021428 rte_memcpy(variable) used:350416440 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 rte_memcpy(constant) used:321370900 rte_memcpy(variable) used:350355796 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 rte_memcpy(constant) used:322704552 rte_memcpy(variable) used:349900832 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 rte_memcpy(constant) used:422705828 rte_memcpy(variable) used:425493328 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 rte_memcpy(constant) used:422421840 @@ not faster rte_memcpy(variable) used:413691412 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 rte_memcpy(constant) used:425233088 @@ not faster rte_memcpy(variable) used:421136724 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 rte_memcpy(constant) used:901014608 @@ not faster rte_memcpy(variable) used:900997388 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 rte_memcpy(constant) used:900803308 @@ not faster rte_memcpy(variable) used:900794076 linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 rte_memcpy(constant) used:901842436 @@ not faster rte_memcpy(variable) used:901218984 linux-mnSyvH:/mnt/sdb/linhf/test # here is my test codes: #include #include #include int main(int narg, char** args) { int i; char buf[1024]; uint64_t start, end; if (narg < 3) { printf("usage:./rte_memcpy_test size times\n"); return 0; } size_t size_v = atoi(args[1]); const size_t size_c = atoi(args[1]); int times = atoi(args[2]); start = rte_rdtsc(); for(i = 0; i < times; i++) { rte_memcpy(buf, buf, size_c); } end = rte_rdtsc(); printf("rte_memcpy(constant) used:%llu\n", end - start); start = rte_rdtsc(); for (i = 0; i < times; i++) { rte_memcpy(buf, buf, size_v); } end = rte_rdtsc(); printf("rte_memcpy(variable) used:%llu\n", end - start); return 0; } -- Regards, Haifeng
[dpdk-dev] some questions about rte_memcpy
On 2015/01/22 16:35, Matthew Hall wrote: > On Thu, Jan 22, 2015 at 01:32:04PM +0800, Linhaifeng wrote: >> Do you mean if call rte_memcpy before rte_eal_init() would crash?why? > No guarantee. But a theory. It might use some things from the EAL init to > figure out which version of the accelerated algorithm to use. This selection is done at compile-time. And if the size is constant, I guess DPDK assumes memcpy is replaced by inline __builtin_memcpy. I haven't checked the performance of builtin memcpy, but probably much faster. Tetsuya > Matthew.
[dpdk-dev] some questions about rte_memcpy
On Thu, Jan 22, 2015 at 08:53:13PM +0800, Linhaifeng wrote: > > > On 2015/1/22 19:34, Bruce Richardson wrote: > > On Thu, Jan 22, 2015 at 07:23:49PM +0900, Tetsuya Mukawa wrote: > >> On 2015/01/22 16:35, Matthew Hall wrote: > >>> On Thu, Jan 22, 2015 at 01:32:04PM +0800, Linhaifeng wrote: > Do you mean if call rte_memcpy before rte_eal_init() would crash?why? > >>> No guarantee. But a theory. It might use some things from the EAL init to > >>> figure out which version of the accelerated algorithm to use. > >> > >> This selection is done at compile-time. > >> And if the size is constant, I guess DPDK assumes memcpy is replaced by > >> inline __builtin_memcpy. > >> I haven't checked the performance of builtin memcpy, but probably much > >> faster. > >> > > > > Yes, that assumption is correct. A couple of years ago we discovered that > > for > > constant size values, the compiler would generate much faster code for us > > using a regular memcpy than rte_memcpy, hence the macro. > > > > /Bruce > > > >> Tetsuya > >> > >>> Matthew. > >> > >> > > > > > > Hi,Bruce > > I test it,most results like you said use constant may be faster,but sometimes > not. > > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 > rte_memcpy(constant) used:279893712 @@ not faster > rte_memcpy(variable) used:277818600 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 > rte_memcpy(constant) used:279264328 @@ not faster > rte_memcpy(variable) used:277667116 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 16 999 > rte_memcpy(constant) used:279491832 @@ not faster > rte_memcpy(variable) used:277622772 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 > rte_memcpy(constant) used:279402156 @@ not faster > rte_memcpy(variable) used:277738464 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 > rte_memcpy(constant) used:279305172 @@ not faster > rte_memcpy(variable) used:277483004 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 32 999 > rte_memcpy(constant) used:279784124 @@ not faster > rte_memcpy(variable) used:277605332 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 > rte_memcpy(constant) used:322817260 > rte_memcpy(variable) used:350333864 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 > rte_memcpy(constant) used:322840748 > rte_memcpy(variable) used:350297868 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 48 999 > rte_memcpy(constant) used:322488240 > rte_memcpy(variable) used:350348652 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 > rte_memcpy(constant) used:322021428 > rte_memcpy(variable) used:350416440 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 > rte_memcpy(constant) used:321370900 > rte_memcpy(variable) used:350355796 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 64 999 > rte_memcpy(constant) used:322704552 > rte_memcpy(variable) used:349900832 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 > rte_memcpy(constant) used:422705828 > rte_memcpy(variable) used:425493328 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 > rte_memcpy(constant) used:422421840 @@ not faster > rte_memcpy(variable) used:413691412 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 128 999 > rte_memcpy(constant) used:425233088 @@ not faster > rte_memcpy(variable) used:421136724 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 > rte_memcpy(constant) used:901014608 @@ not faster > rte_memcpy(variable) used:900997388 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 > rte_memcpy(constant) used:900803308 @@ not faster > rte_memcpy(variable) used:900794076 > linux-mnSyvH:/mnt/sdb/linhf/test # ./rte_memcpy_test 256 999 > rte_memcpy(constant) used:901842436 @@ not faster > rte_memcpy(variable) used:901218984 > linux-mnSyvH:/mnt/sdb/linhf/test # > > > > here is my test codes: > > #include > #include > #include > > > int main(int narg, char** args) > { > int i; > char buf[1024]; > uint64_t start, end; > > if (narg < 3) { > printf("usage:./rte_memcpy_test size times\n"); > return 0; > } > > size_t size_v = atoi(args[1]); > const size_t size_c = atoi(args[1]); This (size_c) is a run-time constant, not a compile-time constant. To trigger the memcpy optimizations inside the compiler, the size value must be constant at compile time. Regards, /Bruce > int times = atoi(args[2]); > > start = rte_rdtsc(); > for(i = 0; i < times; i++) { > rte_memcpy(buf, buf, size_c); > } > end = rte_rdtsc(); > printf("rte_memcpy(constant) used:%llu\n", end - start); > > start = rte_rdtsc();
[dpdk-dev] some questions about rte_memcpy
On 2015/1/22 12:45, Matthew Hall wrote: > One theory. Many DPDK functions crash if they are called before > rte_eal_init() > is called. So perhaps this could be a cause, since that won't have been > called > when working on a constant Hi, Matthew Thank you for your response. Do you mean if call rte_memcpy before rte_eal_init() would crash?why? -- Regards, Haifeng
[dpdk-dev] some questions about rte_memcpy
#define rte_memcpy(dst, src, n) \ ((__builtin_constant_p(n)) ? \ memcpy((dst), (src), (n)) : \ rte_memcpy_func((dst), (src), (n))) Why call memcpy when n is constant variable? Can i change them to the follow codes? #define rte_memcpy(dst, src, n) \ { \ int num = n;\ rte_memcpy_func((dst), (src), (num))) \ } -- Regards, Haifeng
[dpdk-dev] some questions about rte_memcpy
On Thu, Jan 22, 2015 at 07:23:49PM +0900, Tetsuya Mukawa wrote: > On 2015/01/22 16:35, Matthew Hall wrote: > > On Thu, Jan 22, 2015 at 01:32:04PM +0800, Linhaifeng wrote: > >> Do you mean if call rte_memcpy before rte_eal_init() would crash?why? > > No guarantee. But a theory. It might use some things from the EAL init to > > figure out which version of the accelerated algorithm to use. > > This selection is done at compile-time. > And if the size is constant, I guess DPDK assumes memcpy is replaced by > inline __builtin_memcpy. > I haven't checked the performance of builtin memcpy, but probably much > faster. > Yes, that assumption is correct. A couple of years ago we discovered that for constant size values, the compiler would generate much faster code for us using a regular memcpy than rte_memcpy, hence the macro. /Bruce > Tetsuya > > > Matthew. > >
[dpdk-dev] some questions about rte_memcpy
On Thu, Jan 22, 2015 at 01:32:04PM +0800, Linhaifeng wrote: > Do you mean if call rte_memcpy before rte_eal_init() would crash?why? No guarantee. But a theory. It might use some things from the EAL init to figure out which version of the accelerated algorithm to use. Matthew.