Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-11 Thread Ling Ma
Is it acceptable for performance improvement or more comments on this patch?

Thanks
Ling

2016-04-05 11:44 GMT+08:00 Ling Ma :
> Hi Longman,
>
>> with some modest increase in performance. That can be hard to justify. Maybe
>> you should find other use cases that involve less changes, but still have
>> noticeable performance improvement. That will make it easier to be accepted.
>
> The attachment is for other use case with the new lock optimization.
> It include two files: main.c (user space workload),
> fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
> (The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket 
> *2HT)
>
> 1. when we run a.out from main.c on original 4.3.0-rc4 version,
> the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)
>
> 2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
> the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)
>
> So we say the new mechanism give us about 2.79x (5277281 / 1887592) 
> improvement.
>
> Appreciate your comments.
>
> Thanks
> Ling


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-11 Thread Ling Ma
Is it acceptable for performance improvement or more comments on this patch?

Thanks
Ling

2016-04-05 11:44 GMT+08:00 Ling Ma :
> Hi Longman,
>
>> with some modest increase in performance. That can be hard to justify. Maybe
>> you should find other use cases that involve less changes, but still have
>> noticeable performance improvement. That will make it easier to be accepted.
>
> The attachment is for other use case with the new lock optimization.
> It include two files: main.c (user space workload),
> fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
> (The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket 
> *2HT)
>
> 1. when we run a.out from main.c on original 4.3.0-rc4 version,
> the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)
>
> 2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
> the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)
>
> So we say the new mechanism give us about 2.79x (5277281 / 1887592) 
> improvement.
>
> Appreciate your comments.
>
> Thanks
> Ling


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-04 Thread Ling Ma
Hi Longman,

> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.

The attachment is for other use case with the new lock optimization.
It include two files: main.c (user space workload),
fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
(The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket *2HT)

1. when we run a.out from main.c on original 4.3.0-rc4 version,
the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)

2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)

So we say the new mechanism give us about 2.79x (5277281 / 1887592) improvement.

Appreciate your comments.

Thanks
Ling


test-lock.tar
Description: Unix tar archive


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-04 Thread Ling Ma
Hi Longman,

> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.

The attachment is for other use case with the new lock optimization.
It include two files: main.c (user space workload),
fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
(The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket *2HT)

1. when we run a.out from main.c on original 4.3.0-rc4 version,
the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)

2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)

So we say the new mechanism give us about 2.79x (5277281 / 1887592) improvement.

Appreciate your comments.

Thanks
Ling


test-lock.tar
Description: Unix tar archive


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Ling Ma
> I have 2 major comments here. First of all, you should break up your patch
> into smaller ones. Large patch like the one in the tar ball is hard to
> review.

Ok, we will do it.

>Secondly, you are modifying over 1000 lines of code in mm/slab.c
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.

In order to be justified the attachment in this letter include 3 files:

1. user space code (thread.c), which  can cause lots of hot kernel spinlock from
__kmalloc and kfree on multi-core platform

2. ali_work_queue.patch , the kernel patch for 4.3.0-rc4,
when we  run user space code (thread.c) based on the patch,
the synchronous operation consumption from __kmalloc and kfree is
about 15% on Intel E5-2699V3

3. org_spin_lock.patch, which is based on above ali_work_queue.patch,
when we  run user space code thread.c based on the patch,
the synchronous operation consumption from __kmalloc and kfree is
about 25% on Intel E5-2699V3


the main difference between ali_work_queue.patch and
org_spin_lock.patch as below:

diff --git a/mm/slab.h b/mm/slab.h
...
-   ali_spinlock_t list_lock;
+   spinlock_t list_lock;
...

diff --git a/mm/slab.c b/mm/slab.c
...
-   alispinlock(lock, );
+   spin_lock((spinlock_t *)lock);
+   fn(para);
+   spin_unlock((spinlock_t *)lock);
...

The above operations remove all performance noise from program modification.

We run  user space code thread.c with ali_work_queue.patch, and
org_spin_lock.patch respectively
 the output from thread.c as below:

ORG NEW
38923684 43380604
38100464 44163011
37769241 43354266
37908638 43554022
37900994 43457066
38495073 43421394
37340217 43146352
38083979 43506951
37713263 43775215
37749871 43487289
37843224 43366055
38173823 43270225
38303612 43214675
37886717 44083950
37736455 43060728
37529307 44607597
38862690 43541484
37992824 44749925
38013454 43572225
37783135 45240502
37745372 44712540
38721413 43584658
38097842 43235392

ORGNEW
TOTAL 874675292 1005486126

So the data tell us the new mechanism can improve performance 14% (
1005486126/874675292) ,
and the operation can be justified fairly.

Thanks
Ling

2016-02-04 5:42 GMT+08:00 Waiman Long :
> On 02/02/2016 11:40 PM, Ling Ma wrote:
>>
>> Longman,
>>
>> The attachment include user space code(thread.c), and kernel
>> patch(ali_work_queue.patch) based on 4.3.0-rc4,
>> we replaced all original spinlock (list_lock) in slab.h/c  with the
>> new mechanism.
>>
>> The thread.c in user space caused lots of hot kernel spinlock from
>> __kmalloc and kfree,
>> perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
>> this patch ,
>> the synchronous operation consumption from __kmalloc and kfree is
>> reduced from 25% to ~15% on Intel E5-2699V3
>> (we also observed the output from user space code (thread.c) is
>> improved clearly)
>
>
> I have 2 major comments here. First of all, you should break up your patch
> into smaller ones. Large patch like the one in the tar ball is hard to
> review. Secondly, you are modifying over 1000 lines of code in mm/slab.c
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.
>
> Cheers,
> Longman
>
>


ali_work_queue.tar.bz2
Description: BZip2 compressed data


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Waiman Long

On 02/02/2016 11:40 PM, Ling Ma wrote:

Longman,

The attachment include user space code(thread.c), and kernel
patch(ali_work_queue.patch) based on 4.3.0-rc4,
we replaced all original spinlock (list_lock) in slab.h/c  with the
new mechanism.

The thread.c in user space caused lots of hot kernel spinlock from
__kmalloc and kfree,
perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
this patch ,
the synchronous operation consumption from __kmalloc and kfree is
reduced from 25% to ~15% on Intel E5-2699V3
(we also observed the output from user space code (thread.c) is
improved clearly)


I have 2 major comments here. First of all, you should break up your 
patch into smaller ones. Large patch like the one in the tar ball is 
hard to review. Secondly, you are modifying over 1000 lines of code in 
mm/slab.c with some modest increase in performance. That can be hard to 
justify. Maybe you should find other use cases that involve less 
changes, but still have noticeable performance improvement. That will 
make it easier to be accepted.


Cheers,
Longman




Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Waiman Long

On 02/02/2016 11:40 PM, Ling Ma wrote:

Longman,

The attachment include user space code(thread.c), and kernel
patch(ali_work_queue.patch) based on 4.3.0-rc4,
we replaced all original spinlock (list_lock) in slab.h/c  with the
new mechanism.

The thread.c in user space caused lots of hot kernel spinlock from
__kmalloc and kfree,
perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
this patch ,
the synchronous operation consumption from __kmalloc and kfree is
reduced from 25% to ~15% on Intel E5-2699V3
(we also observed the output from user space code (thread.c) is
improved clearly)


I have 2 major comments here. First of all, you should break up your 
patch into smaller ones. Large patch like the one in the tar ball is 
hard to review. Secondly, you are modifying over 1000 lines of code in 
mm/slab.c with some modest increase in performance. That can be hard to 
justify. Maybe you should find other use cases that involve less 
changes, but still have noticeable performance improvement. That will 
make it easier to be accepted.


Cheers,
Longman




Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Ling Ma
> I have 2 major comments here. First of all, you should break up your patch
> into smaller ones. Large patch like the one in the tar ball is hard to
> review.

Ok, we will do it.

>Secondly, you are modifying over 1000 lines of code in mm/slab.c
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.

In order to be justified the attachment in this letter include 3 files:

1. user space code (thread.c), which  can cause lots of hot kernel spinlock from
__kmalloc and kfree on multi-core platform

2. ali_work_queue.patch , the kernel patch for 4.3.0-rc4,
when we  run user space code (thread.c) based on the patch,
the synchronous operation consumption from __kmalloc and kfree is
about 15% on Intel E5-2699V3

3. org_spin_lock.patch, which is based on above ali_work_queue.patch,
when we  run user space code thread.c based on the patch,
the synchronous operation consumption from __kmalloc and kfree is
about 25% on Intel E5-2699V3


the main difference between ali_work_queue.patch and
org_spin_lock.patch as below:

diff --git a/mm/slab.h b/mm/slab.h
...
-   ali_spinlock_t list_lock;
+   spinlock_t list_lock;
...

diff --git a/mm/slab.c b/mm/slab.c
...
-   alispinlock(lock, );
+   spin_lock((spinlock_t *)lock);
+   fn(para);
+   spin_unlock((spinlock_t *)lock);
...

The above operations remove all performance noise from program modification.

We run  user space code thread.c with ali_work_queue.patch, and
org_spin_lock.patch respectively
 the output from thread.c as below:

ORG NEW
38923684 43380604
38100464 44163011
37769241 43354266
37908638 43554022
37900994 43457066
38495073 43421394
37340217 43146352
38083979 43506951
37713263 43775215
37749871 43487289
37843224 43366055
38173823 43270225
38303612 43214675
37886717 44083950
37736455 43060728
37529307 44607597
38862690 43541484
37992824 44749925
38013454 43572225
37783135 45240502
37745372 44712540
38721413 43584658
38097842 43235392

ORGNEW
TOTAL 874675292 1005486126

So the data tell us the new mechanism can improve performance 14% (
1005486126/874675292) ,
and the operation can be justified fairly.

Thanks
Ling

2016-02-04 5:42 GMT+08:00 Waiman Long :
> On 02/02/2016 11:40 PM, Ling Ma wrote:
>>
>> Longman,
>>
>> The attachment include user space code(thread.c), and kernel
>> patch(ali_work_queue.patch) based on 4.3.0-rc4,
>> we replaced all original spinlock (list_lock) in slab.h/c  with the
>> new mechanism.
>>
>> The thread.c in user space caused lots of hot kernel spinlock from
>> __kmalloc and kfree,
>> perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
>> this patch ,
>> the synchronous operation consumption from __kmalloc and kfree is
>> reduced from 25% to ~15% on Intel E5-2699V3
>> (we also observed the output from user space code (thread.c) is
>> improved clearly)
>
>
> I have 2 major comments here. First of all, you should break up your patch
> into smaller ones. Large patch like the one in the tar ball is hard to
> review. Secondly, you are modifying over 1000 lines of code in mm/slab.c
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.
>
> Cheers,
> Longman
>
>


ali_work_queue.tar.bz2
Description: BZip2 compressed data


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
The attachment(thread.c) can tell us the new mechanism improve output
from  the user space code (thread,c) by 1.14x (1174810406/1026910602,
kernel spinlock consumption is reduced from 25% to 15%) as below:

  ORG NEW
38186815 43644156
38340186 43121265
38383155 44087753
38567102 43532586
38027878 43622700
38011581 43396376
37861959 43322857
37963215 43375528
38039247 43618315
37989106 43406187
37916912 44163029
39053184 43138581
37928359 43247866
37967417 43390352
37909796 43218250
37727531 43256009
38032818 43460496
38001860 43536100
38019929 44231331
37846621 43550597
37823231 44229887
38108158 43142689
37771900 43228168
37652536 43901042
37649114 43172690
37591314 43380004
38539678 43435592

Total 1026910602 1174810406

Thanks
Ling

2016-02-03 12:40 GMT+08:00 Ling Ma :
> Longman,
>
> The attachment include user space code(thread.c), and kernel
> patch(ali_work_queue.patch) based on 4.3.0-rc4,
> we replaced all original spinlock (list_lock) in slab.h/c  with the
> new mechanism.
>
> The thread.c in user space caused lots of hot kernel spinlock from
> __kmalloc and kfree,
> perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
> this patch ,
> the synchronous operation consumption from __kmalloc and kfree is
> reduced from 25% to ~15% on Intel E5-2699V3
> (we also observed the output from user space code (thread.c) is
> improved clearly)
>
> Peter, we will send the update version according to your comments.
>
> Thanks
> Ling
>
>
> 2016-01-19 23:36 GMT+08:00 Waiman Long :
>> On 01/19/2016 03:52 AM, Ling Ma wrote:
>>>
>>> Is it acceptable for performance improvement or more comments on this
>>> patch?
>>>
>>> Thanks
>>> Ling
>>>
>>>
>>
>> Your alispinlock patchset should also include a use case where the lock is
>> used by some code within the kernel with test that can show a performance
>> improvement so that the reviewers can independently try it out and play
>> around with it. The kernel community will not accept any patch without a use
>> case in the kernel.
>>
>> Your lock_test.tar file is not good enough as it is not a performance test
>> of the patch that you sent out.
>>
>> Cheers,
>> Longman
/**
	Test Case:
		OpenDir, Get status and close it.
*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_DIR "/tmp/thread"
#define MAX_TEST_THREAD (80)
#define MAX_TEST_FILE 5000

static unsigned long *result[MAX_TEST_THREAD];
static int stop = 0;

static void* case_function(void *para)
{
	int id = (int)(long)para;
	DIR *pDir;
	struct stat f_stat;
	struct dirent *entry=NULL;
	char path[256];
	char cmd[512];
	
	int filecnt   = 0;
	int dircnt= 0;
	int filetotalsize = 0;
	unsigned long myresult = 0;
	int f = 0;
	
	result[id] = 

	/* Goto my path and construct empty file */
	sprintf(path, "%s/%d", TEST_DIR, id);
	printf("Creating temp file at %s\n", path);

	sprintf(cmd, "mkdir %s", path);
	system(cmd);
	chdir(path);
	for (f = 0; f < MAX_TEST_FILE; f++)
	{
		char name[256];

		sprintf(name, "%s/%d", path, f);
		int t = open(name,  O_RDWR | O_CREAT | O_TRUNC, S_IRWXU);
		if (t != -1)
			close(t);
		else
		{
			printf("Errno = %d.\n", errno);
			exit(errno);
		}		
	}

again:
	if ((pDir = opendir(path)) == NULL)
	{
		printf("打开 %s 错误:没有那个文件或目录\n", TEST_DIR);
		goto err;
	}
	
	while ((entry = readdir(pDir)) != NULL)
	{
		struct stat buf;
		if (entry->d_name[0] == '.')
			continue;
		
		//f = open(entry->d_name, 0);
		f = stat(entry->d_name, );
		
		if (f)
			close(f);
		myresult++;
		
		
		//printf("Filename %s, size %10d",entry->d_name, f_stat.st_size);
	}

	closedir(pDir);
	

	/* Need to stop */
	if (!stop)
		goto again;
	return 0;

err:
	;
}

void main()
{
	int i;
	pthread_t thread;

	system("mkdir "TEST_DIR);
		
	for (i = 0; i < MAX_TEST_THREAD; i++)
	{
		pthread_create(, NULL, case_function, (void*)(long)i);
	}

	while (1)
	{
		sleep(1);
		unsigned long times = 0;
		//printf("Statistics:\n");

		for (i = 0; i < MAX_TEST_THREAD; i++)
		{
			//printf("%d\t", *result[i]);
			times =times +  *result[i];
		}
		printf("%ld\t\n", times);
		for (i = 0; i < MAX_TEST_THREAD; i++)
			*result[i] = 0;
	}
}


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
Longman,

The attachment include user space code(thread.c), and kernel
patch(ali_work_queue.patch) based on 4.3.0-rc4,
we replaced all original spinlock (list_lock) in slab.h/c  with the
new mechanism.

The thread.c in user space caused lots of hot kernel spinlock from
__kmalloc and kfree,
perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
this patch ,
the synchronous operation consumption from __kmalloc and kfree is
reduced from 25% to ~15% on Intel E5-2699V3
(we also observed the output from user space code (thread.c) is
improved clearly)

Peter, we will send the update version according to your comments.

Thanks
Ling


2016-01-19 23:36 GMT+08:00 Waiman Long :
> On 01/19/2016 03:52 AM, Ling Ma wrote:
>>
>> Is it acceptable for performance improvement or more comments on this
>> patch?
>>
>> Thanks
>> Ling
>>
>>
>
> Your alispinlock patchset should also include a use case where the lock is
> used by some code within the kernel with test that can show a performance
> improvement so that the reviewers can independently try it out and play
> around with it. The kernel community will not accept any patch without a use
> case in the kernel.
>
> Your lock_test.tar file is not good enough as it is not a performance test
> of the patch that you sent out.
>
> Cheers,
> Longman


ali_work_queue.tar.bz2
Description: BZip2 compressed data


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
The attachment(thread.c) can tell us the new mechanism improve output
from  the user space code (thread,c) by 1.14x (1174810406/1026910602,
kernel spinlock consumption is reduced from 25% to 15%) as below:

  ORG NEW
38186815 43644156
38340186 43121265
38383155 44087753
38567102 43532586
38027878 43622700
38011581 43396376
37861959 43322857
37963215 43375528
38039247 43618315
37989106 43406187
37916912 44163029
39053184 43138581
37928359 43247866
37967417 43390352
37909796 43218250
37727531 43256009
38032818 43460496
38001860 43536100
38019929 44231331
37846621 43550597
37823231 44229887
38108158 43142689
37771900 43228168
37652536 43901042
37649114 43172690
37591314 43380004
38539678 43435592

Total 1026910602 1174810406

Thanks
Ling

2016-02-03 12:40 GMT+08:00 Ling Ma :
> Longman,
>
> The attachment include user space code(thread.c), and kernel
> patch(ali_work_queue.patch) based on 4.3.0-rc4,
> we replaced all original spinlock (list_lock) in slab.h/c  with the
> new mechanism.
>
> The thread.c in user space caused lots of hot kernel spinlock from
> __kmalloc and kfree,
> perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
> this patch ,
> the synchronous operation consumption from __kmalloc and kfree is
> reduced from 25% to ~15% on Intel E5-2699V3
> (we also observed the output from user space code (thread.c) is
> improved clearly)
>
> Peter, we will send the update version according to your comments.
>
> Thanks
> Ling
>
>
> 2016-01-19 23:36 GMT+08:00 Waiman Long :
>> On 01/19/2016 03:52 AM, Ling Ma wrote:
>>>
>>> Is it acceptable for performance improvement or more comments on this
>>> patch?
>>>
>>> Thanks
>>> Ling
>>>
>>>
>>
>> Your alispinlock patchset should also include a use case where the lock is
>> used by some code within the kernel with test that can show a performance
>> improvement so that the reviewers can independently try it out and play
>> around with it. The kernel community will not accept any patch without a use
>> case in the kernel.
>>
>> Your lock_test.tar file is not good enough as it is not a performance test
>> of the patch that you sent out.
>>
>> Cheers,
>> Longman
/**
	Test Case:
		OpenDir, Get status and close it.
*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_DIR "/tmp/thread"
#define MAX_TEST_THREAD (80)
#define MAX_TEST_FILE 5000

static unsigned long *result[MAX_TEST_THREAD];
static int stop = 0;

static void* case_function(void *para)
{
	int id = (int)(long)para;
	DIR *pDir;
	struct stat f_stat;
	struct dirent *entry=NULL;
	char path[256];
	char cmd[512];
	
	int filecnt   = 0;
	int dircnt= 0;
	int filetotalsize = 0;
	unsigned long myresult = 0;
	int f = 0;
	
	result[id] = 

	/* Goto my path and construct empty file */
	sprintf(path, "%s/%d", TEST_DIR, id);
	printf("Creating temp file at %s\n", path);

	sprintf(cmd, "mkdir %s", path);
	system(cmd);
	chdir(path);
	for (f = 0; f < MAX_TEST_FILE; f++)
	{
		char name[256];

		sprintf(name, "%s/%d", path, f);
		int t = open(name,  O_RDWR | O_CREAT | O_TRUNC, S_IRWXU);
		if (t != -1)
			close(t);
		else
		{
			printf("Errno = %d.\n", errno);
			exit(errno);
		}		
	}

again:
	if ((pDir = opendir(path)) == NULL)
	{
		printf("打开 %s 错误:没有那个文件或目录\n", TEST_DIR);
		goto err;
	}
	
	while ((entry = readdir(pDir)) != NULL)
	{
		struct stat buf;
		if (entry->d_name[0] == '.')
			continue;
		
		//f = open(entry->d_name, 0);
		f = stat(entry->d_name, );
		
		if (f)
			close(f);
		myresult++;
		
		
		//printf("Filename %s, size %10d",entry->d_name, f_stat.st_size);
	}

	closedir(pDir);
	

	/* Need to stop */
	if (!stop)
		goto again;
	return 0;

err:
	;
}

void main()
{
	int i;
	pthread_t thread;

	system("mkdir "TEST_DIR);
		
	for (i = 0; i < MAX_TEST_THREAD; i++)
	{
		pthread_create(, NULL, case_function, (void*)(long)i);
	}

	while (1)
	{
		sleep(1);
		unsigned long times = 0;
		//printf("Statistics:\n");

		for (i = 0; i < MAX_TEST_THREAD; i++)
		{
			//printf("%d\t", *result[i]);
			times =times +  *result[i];
		}
		printf("%ld\t\n", times);
		for (i = 0; i < MAX_TEST_THREAD; i++)
			*result[i] = 0;
	}
}


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
Longman,

The attachment include user space code(thread.c), and kernel
patch(ali_work_queue.patch) based on 4.3.0-rc4,
we replaced all original spinlock (list_lock) in slab.h/c  with the
new mechanism.

The thread.c in user space caused lots of hot kernel spinlock from
__kmalloc and kfree,
perf top -d1 shows ~25%  before ali_work_queue.patch,after appending
this patch ,
the synchronous operation consumption from __kmalloc and kfree is
reduced from 25% to ~15% on Intel E5-2699V3
(we also observed the output from user space code (thread.c) is
improved clearly)

Peter, we will send the update version according to your comments.

Thanks
Ling


2016-01-19 23:36 GMT+08:00 Waiman Long :
> On 01/19/2016 03:52 AM, Ling Ma wrote:
>>
>> Is it acceptable for performance improvement or more comments on this
>> patch?
>>
>> Thanks
>> Ling
>>
>>
>
> Your alispinlock patchset should also include a use case where the lock is
> used by some code within the kernel with test that can show a performance
> improvement so that the reviewers can independently try it out and play
> around with it. The kernel community will not accept any patch without a use
> case in the kernel.
>
> Your lock_test.tar file is not good enough as it is not a performance test
> of the patch that you sent out.
>
> Cheers,
> Longman


ali_work_queue.tar.bz2
Description: BZip2 compressed data


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread One Thousand Gnomes
On Wed, 6 Jan 2016 09:21:06 +0100
Peter Zijlstra  wrote:

> On Wed, Jan 06, 2016 at 09:16:43AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > > > It suffers the typical problems all those constructs do; namely it
> > > > wrecks accountability.
> > > 
> > > That's "government thinking" ;-) - for most real users throughput is
> > > more important than accountability. With the right API it ought to also
> > > be compile time switchable.
> > 
> > Its to do with having been involved with -rt. RT wants to do
> > accountability for such things because of PI and sorts.
> 
> Also, real people really do care about latency too, very bad worst case
> spikes to upset things.

Some yes - I'm familiar with the way some of the big financial number
crunching jobs need this. There are also people who instead care a lot
about throughput. Anything like this needs to end up with an external API
which looks the same whether the work is done via one thread or the other.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread Peter Zijlstra
On Wed, Jan 06, 2016 at 09:16:43AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > > It suffers the typical problems all those constructs do; namely it
> > > wrecks accountability.
> > 
> > That's "government thinking" ;-) - for most real users throughput is
> > more important than accountability. With the right API it ought to also
> > be compile time switchable.
> 
> Its to do with having been involved with -rt. RT wants to do
> accountability for such things because of PI and sorts.

Also, real people really do care about latency too, very bad worst case
spikes to upset things.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread Peter Zijlstra
On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > It suffers the typical problems all those constructs do; namely it
> > wrecks accountability.
> 
> That's "government thinking" ;-) - for most real users throughput is
> more important than accountability. With the right API it ought to also
> be compile time switchable.

Its to do with having been involved with -rt. RT wants to do
accountability for such things because of PI and sorts.

> > But here that is compounded by the fact that you inject other people's
> > work into 'your' lock region, thereby bloating lock hold times. Worse,
> > afaict (from a quick reading) there really isn't a bound on the amount
> > of work you inject.
> 
> That should be relatively easy to fix but for this kind of lock you
> normally get the big wins from stuff that is only a short amount of
> executing code. The fairness your trade in the cases it is useful should
> be tiny except under extreme load, where the "accountability first"
> behaviour would be to fall over in a heap.
> 
> If your "lock" involves a lot of work then it probably should be a work
> queue or not using this kind of locking.

Sure, but the fact that it was not even mentioned/considered doesn't
give me a warm fuzzy feeling.

> > And while its a cute collapse of an MCS lock and lockless list style
> > work queue (MCS after all is a lockless list), saving a few cycles from
> > the naive spinlock+llist implementation of the same thing, I really
> > do not see enough justification for any of this.
> 
> I've only personally dealt with such locks in the embedded space but
> there it was a lot more than a few cycles because you go from

Nah, what I meant was that you can do the same callback style construct
with a llist and a spinlock.

> The claim in the original post is 3x performance but doesn't explain
> performance doing what, or which kernel locks were switched and what
> patches were used. I don't find the numbers hard to believe for a big big
> box, but I'd like to see the actual use case patches so it can be benched
> with other workloads and also for latency and the like.

Very much agreed, those claims need to be substantiated with actual
patches using this thing and independently verified.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread One Thousand Gnomes
On Wed, 6 Jan 2016 09:21:06 +0100
Peter Zijlstra  wrote:

> On Wed, Jan 06, 2016 at 09:16:43AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > > > It suffers the typical problems all those constructs do; namely it
> > > > wrecks accountability.
> > > 
> > > That's "government thinking" ;-) - for most real users throughput is
> > > more important than accountability. With the right API it ought to also
> > > be compile time switchable.
> > 
> > Its to do with having been involved with -rt. RT wants to do
> > accountability for such things because of PI and sorts.
> 
> Also, real people really do care about latency too, very bad worst case
> spikes to upset things.

Some yes - I'm familiar with the way some of the big financial number
crunching jobs need this. There are also people who instead care a lot
about throughput. Anything like this needs to end up with an external API
which looks the same whether the work is done via one thread or the other.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread Peter Zijlstra
On Wed, Jan 06, 2016 at 09:16:43AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > > It suffers the typical problems all those constructs do; namely it
> > > wrecks accountability.
> > 
> > That's "government thinking" ;-) - for most real users throughput is
> > more important than accountability. With the right API it ought to also
> > be compile time switchable.
> 
> Its to do with having been involved with -rt. RT wants to do
> accountability for such things because of PI and sorts.

Also, real people really do care about latency too, very bad worst case
spikes to upset things.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-06 Thread Peter Zijlstra
On Tue, Jan 05, 2016 at 09:42:27PM +, One Thousand Gnomes wrote:
> > It suffers the typical problems all those constructs do; namely it
> > wrecks accountability.
> 
> That's "government thinking" ;-) - for most real users throughput is
> more important than accountability. With the right API it ought to also
> be compile time switchable.

Its to do with having been involved with -rt. RT wants to do
accountability for such things because of PI and sorts.

> > But here that is compounded by the fact that you inject other people's
> > work into 'your' lock region, thereby bloating lock hold times. Worse,
> > afaict (from a quick reading) there really isn't a bound on the amount
> > of work you inject.
> 
> That should be relatively easy to fix but for this kind of lock you
> normally get the big wins from stuff that is only a short amount of
> executing code. The fairness your trade in the cases it is useful should
> be tiny except under extreme load, where the "accountability first"
> behaviour would be to fall over in a heap.
> 
> If your "lock" involves a lot of work then it probably should be a work
> queue or not using this kind of locking.

Sure, but the fact that it was not even mentioned/considered doesn't
give me a warm fuzzy feeling.

> > And while its a cute collapse of an MCS lock and lockless list style
> > work queue (MCS after all is a lockless list), saving a few cycles from
> > the naive spinlock+llist implementation of the same thing, I really
> > do not see enough justification for any of this.
> 
> I've only personally dealt with such locks in the embedded space but
> there it was a lot more than a few cycles because you go from

Nah, what I meant was that you can do the same callback style construct
with a llist and a spinlock.

> The claim in the original post is 3x performance but doesn't explain
> performance doing what, or which kernel locks were switched and what
> patches were used. I don't find the numbers hard to believe for a big big
> box, but I'd like to see the actual use case patches so it can be benched
> with other workloads and also for latency and the like.

Very much agreed, those claims need to be substantiated with actual
patches using this thing and independently verified.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread One Thousand Gnomes
> It suffers the typical problems all those constructs do; namely it
> wrecks accountability.

That's "government thinking" ;-) - for most real users throughput is
more important than accountability. With the right API it ought to also
be compile time switchable.

> But here that is compounded by the fact that you inject other people's
> work into 'your' lock region, thereby bloating lock hold times. Worse,
> afaict (from a quick reading) there really isn't a bound on the amount
> of work you inject.

That should be relatively easy to fix but for this kind of lock you
normally get the big wins from stuff that is only a short amount of
executing code. The fairness your trade in the cases it is useful should
be tiny except under extreme load, where the "accountability first"
behaviour would be to fall over in a heap.

If your "lock" involves a lot of work then it probably should be a work
queue or not using this kind of locking.

> And while its a cute collapse of an MCS lock and lockless list style
> work queue (MCS after all is a lockless list), saving a few cycles from
> the naive spinlock+llist implementation of the same thing, I really
> do not see enough justification for any of this.

I've only personally dealt with such locks in the embedded space but
there it was a lot more than a few cycles because you go from


take lock
spins
pull things into cache
do stuff
cache lines go write/exclusive
unlock

take lock
move all the cache
do stuff
etc

to

take lock
queue work
pull things into cache
do work 1
caches line go write/exclusive
do work 2

unlock
done

and for the kind of stuff you apply those locks you got big improvements.
Even on crappy little embedded processors cache bouncing hurts. Even
better work merging locks like this tend to improve throughput more the
higher the contention unlike most other lock types.

The claim in the original post is 3x performance but doesn't explain
performance doing what, or which kernel locks were switched and what
patches were used. I don't find the numbers hard to believe for a big big
box, but I'd like to see the actual use case patches so it can be benched
with other workloads and also for latency and the like.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread Peter Zijlstra
On Thu, Dec 31, 2015 at 04:09:34PM +0800, ling.ma.prog...@gmail.com wrote:
> +void alispinlock(struct ali_spinlock *lock, struct ali_spinlock_info *ali)
> +{
> + struct ali_spinlock_info *next, *old;
> +
> + ali->next = NULL;
> + ali->locked = 1;
> + old = xchg(>lock_p, ali);
> +
> + /* If NULL we are the first one */
> + if (old) {
> + WRITE_ONCE(old->next, ali);
> + if(ali->flags & ALI_LOCK_FREE)
> + return;
> + while((READ_ONCE(ali->locked)))
> + cpu_relax_lowlatency();
> + return;
> + }
> + old = READ_ONCE(lock->lock_p);
> +
> + /* Handle all pending works */
> +repeat:  
> + if(old == ali)
> + goto end;
> +
> + while (!(next = READ_ONCE(ali->next)))
> + cpu_relax();
> + 
> + ali->fn(ali->para);
> + ali->locked = 0;
> +
> + if(old != next) {
> + while (!(ali = READ_ONCE(next->next)))
> + cpu_relax();
> + next->fn(next->para);
> + next->locked = 0;
> + goto repeat;
> + 
> + } else
> + ali = next;

So I have a whole bunch of problems with this thing.. For one I object
to this being called a lock. Its much more like an async work queue like
thing.

It suffers the typical problems all those constructs do; namely it
wrecks accountability.

But here that is compounded by the fact that you inject other people's
work into 'your' lock region, thereby bloating lock hold times. Worse,
afaict (from a quick reading) there really isn't a bound on the amount
of work you inject.

This will completely wreck scheduling latency. At the very least the
callback loop should have a need_resched() test on, but even that will
not work if this has IRQs disabled.


And while its a cute collapse of an MCS lock and lockless list style
work queue (MCS after all is a lockless list), saving a few cycles from
the naive spinlock+llist implementation of the same thing, I really
do not see enough justification for any of this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread Waiman Long

On 12/31/2015 03:09 AM, ling.ma.prog...@gmail.com wrote:

From: Ma Ling

Hi ALL,

Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.

However if the serialized works are sent to one core and executed
when lock contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Acceleration from Lock Integration
(ali spinlock)

Usually when requests are queued, we have to wait work to submit
one bye one, in order to improve the whole throughput further,
we introduce LOCK_FREE. So when requests are sent to lock owner,
requester may do other works in parallelism, then ali_spin_is_completed
function could tell us whether the work has been completed.

The new code is based on qspinlock and implement Lock Integration,
improves performance up to 3X on intel platform with 72 cores(18x2HTx2S HSW),
2X on ARM platform with 96 cores too. And additional trival changes on
Makefile/Kconfig are made to enable compiling of this feature on x86 platform.
(We would like to do further experiments according to your requirement)

Happy New Year 2016!
Ling

Signed-off-by: Ma Ling
---
  arch/x86/Kconfig |1 +
  include/linux/alispinlock.h  |   41 ++
  kernel/Kconfig.locks |7 +++
  kernel/locking/Makefile  |1 +
  kernel/locking/alispinlock.c |   97 ++
  5 files changed, 147 insertions(+), 0 deletions(-)
  create mode 100644 include/linux/alispinlock.h
  create mode 100644 kernel/locking/alispinlock.c




You should include additional patches that illustrate the possible use 
cases and performance improvement before and after the patches. This 
will allow the reviewers to actually try it out and play with it.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread Waiman Long

On 12/31/2015 03:09 AM, ling.ma.prog...@gmail.com wrote:

From: Ma Ling

Hi ALL,

Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.

However if the serialized works are sent to one core and executed
when lock contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Acceleration from Lock Integration
(ali spinlock)

Usually when requests are queued, we have to wait work to submit
one bye one, in order to improve the whole throughput further,
we introduce LOCK_FREE. So when requests are sent to lock owner,
requester may do other works in parallelism, then ali_spin_is_completed
function could tell us whether the work has been completed.

The new code is based on qspinlock and implement Lock Integration,
improves performance up to 3X on intel platform with 72 cores(18x2HTx2S HSW),
2X on ARM platform with 96 cores too. And additional trival changes on
Makefile/Kconfig are made to enable compiling of this feature on x86 platform.
(We would like to do further experiments according to your requirement)

Happy New Year 2016!
Ling

Signed-off-by: Ma Ling
---
  arch/x86/Kconfig |1 +
  include/linux/alispinlock.h  |   41 ++
  kernel/Kconfig.locks |7 +++
  kernel/locking/Makefile  |1 +
  kernel/locking/alispinlock.c |   97 ++
  5 files changed, 147 insertions(+), 0 deletions(-)
  create mode 100644 include/linux/alispinlock.h
  create mode 100644 kernel/locking/alispinlock.c




You should include additional patches that illustrate the possible use 
cases and performance improvement before and after the patches. This 
will allow the reviewers to actually try it out and play with it.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread Peter Zijlstra
On Thu, Dec 31, 2015 at 04:09:34PM +0800, ling.ma.prog...@gmail.com wrote:
> +void alispinlock(struct ali_spinlock *lock, struct ali_spinlock_info *ali)
> +{
> + struct ali_spinlock_info *next, *old;
> +
> + ali->next = NULL;
> + ali->locked = 1;
> + old = xchg(>lock_p, ali);
> +
> + /* If NULL we are the first one */
> + if (old) {
> + WRITE_ONCE(old->next, ali);
> + if(ali->flags & ALI_LOCK_FREE)
> + return;
> + while((READ_ONCE(ali->locked)))
> + cpu_relax_lowlatency();
> + return;
> + }
> + old = READ_ONCE(lock->lock_p);
> +
> + /* Handle all pending works */
> +repeat:  
> + if(old == ali)
> + goto end;
> +
> + while (!(next = READ_ONCE(ali->next)))
> + cpu_relax();
> + 
> + ali->fn(ali->para);
> + ali->locked = 0;
> +
> + if(old != next) {
> + while (!(ali = READ_ONCE(next->next)))
> + cpu_relax();
> + next->fn(next->para);
> + next->locked = 0;
> + goto repeat;
> + 
> + } else
> + ali = next;

So I have a whole bunch of problems with this thing.. For one I object
to this being called a lock. Its much more like an async work queue like
thing.

It suffers the typical problems all those constructs do; namely it
wrecks accountability.

But here that is compounded by the fact that you inject other people's
work into 'your' lock region, thereby bloating lock hold times. Worse,
afaict (from a quick reading) there really isn't a bound on the amount
of work you inject.

This will completely wreck scheduling latency. At the very least the
callback loop should have a need_resched() test on, but even that will
not work if this has IRQs disabled.


And while its a cute collapse of an MCS lock and lockless list style
work queue (MCS after all is a lockless list), saving a few cycles from
the naive spinlock+llist implementation of the same thing, I really
do not see enough justification for any of this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-01-05 Thread One Thousand Gnomes
> It suffers the typical problems all those constructs do; namely it
> wrecks accountability.

That's "government thinking" ;-) - for most real users throughput is
more important than accountability. With the right API it ought to also
be compile time switchable.

> But here that is compounded by the fact that you inject other people's
> work into 'your' lock region, thereby bloating lock hold times. Worse,
> afaict (from a quick reading) there really isn't a bound on the amount
> of work you inject.

That should be relatively easy to fix but for this kind of lock you
normally get the big wins from stuff that is only a short amount of
executing code. The fairness your trade in the cases it is useful should
be tiny except under extreme load, where the "accountability first"
behaviour would be to fall over in a heap.

If your "lock" involves a lot of work then it probably should be a work
queue or not using this kind of locking.

> And while its a cute collapse of an MCS lock and lockless list style
> work queue (MCS after all is a lockless list), saving a few cycles from
> the naive spinlock+llist implementation of the same thing, I really
> do not see enough justification for any of this.

I've only personally dealt with such locks in the embedded space but
there it was a lot more than a few cycles because you go from


take lock
spins
pull things into cache
do stuff
cache lines go write/exclusive
unlock

take lock
move all the cache
do stuff
etc

to

take lock
queue work
pull things into cache
do work 1
caches line go write/exclusive
do work 2

unlock
done

and for the kind of stuff you apply those locks you got big improvements.
Even on crappy little embedded processors cache bouncing hurts. Even
better work merging locks like this tend to improve throughput more the
higher the contention unlike most other lock types.

The claim in the original post is 3x performance but doesn't explain
performance doing what, or which kernel locks were switched and what
patches were used. I don't find the numbers hard to believe for a big big
box, but I'd like to see the actual use case patches so it can be benched
with other workloads and also for latency and the like.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/